数据科学家应该知道的 7 大聚类算法
聚类主要关注基于数据点之间的各种相似性或不同性对数据点进行分组的过程。它广泛用于机器学习和数据科学,通常被认为是一种无监督学习方法。随后,有各种标准的聚类算法被用来对这些数据点进行分组。根据聚类要求,由输入数据点形成的聚类是分离的,在这里,开始了数据科学家需要玩的主要游戏。这是因为现在,他们需要对任何聚类算法进行选择,以便可用的数据集可以以聚类的形式很好地表示
同时,如果您想成为一名有抱负的数据科学家或想在数据科学市场上获得任何知名职位,那么您必须了解一下顶级聚类算法。在这里,在本文中,我们将讨论所有崭露头角的数据科学家都应该知道的前 7 种聚类算法:
1. 凝聚层次聚类
层次聚类在我们的日常生活中很常见,当它产生嵌套的聚类序列时,我们经常忽略它。此类集群通过自上而下或自下而上的方法排列。自上而下意味着从源到其一般子集查看数据集,就像父亲、孩子和孙子一样,而自下而上让我们查看从一般数据集到源的数据集。实际上,自下而上的方法不过是 凝聚层次聚类,其中各种数据点被聚类为多个数据对。
然后,将得到的这些明显是簇的数据对进行合并,直到得到一个由所有数据点组成的大单例簇。考虑这种层次聚类的树状视觉表示的名称吗?它的名字叫树状图。现在让我们成功实施凝聚层次聚类的以下步骤
算法:
- Each data point is a cluster and let’s assume that the total number of clusters is m.
- Now, setting up the proximity/distance matrix of m*n is all you need to do keeping in mind mapping the distance between the two data points participating in forming a cluster.
- Meanwhile, you will find one or some data pairs with common similarities. Use that pair which is much similar to other ones already existing and then, keep on updating the distance matrix.
- To measure the distance between the endpoints well, you can use any of the techniques – single-link, centroid, complete-link, and average-link.
- Keep on updating the matrix through any of the distance-measuring techniques mentioned in the above step till you reach the source where a singleton cluster consisting of all objects is left.
2. 平衡迭代归约和聚类
在市场上被称为 BIRCH 的平衡迭代减少和聚类使用层次结构是最好的无监督聚类算法之一。这是一种四阶段算法,借助适当的层次结构有效地吸收有用的数据识别模式,以便可以管理包含多维数据点的大型数据库,而不会影响集群的质量。
思考算法是否受时间和内存等约束?毫不犹豫地,这个算法有一些限制,但它仍然有可能通过一次数据库扫描找到质量最好的集群。看一下BIRCH的四个阶段简要说明:
- 第一阶段:这是最重要的阶段之一,从创建 CF 或聚类特征树的想法开始。在这里,在这个阶段,有几个步骤,第一个是:
- CF is represented as a three-dimensional vector in this form CF = (N, LS, SS). N is the number of instances/data points selected, LS is their linear sum, while SS is the square sum of N.
- There will be many CFs like the above steps which will iteratively be represented as hierarchically balanced by a tree named CF tree. You will find two parameters of this tree:
- Branching Factor (to estimate the maximum number of children a leaf node can accommodate)
- Threshold (the maximum diameter value amongst data points within a sub-cluster present at leaf nodes). Furthermore, there are other parameters like T (size of CF tree) and P (size of the page where it is required to fit the leaf or non-leaf node).
- Now, you might be thinking about how this CF tree is represented hierarchically? For this, non-leaf nodes are exhibited as [CF_{i}, child_{i}], where [child_{i}] is a pointer pointing towards its ith child node while [CF_{i}] is the clustering feature representing well the sub-cluster associated.
- At the end of this phase, a CF tree is created well so that we may now jump to another phase which is scanning the CF tree.
- 第二阶段:此阶段可以发音为数据压缩或 CF 树调整大小阶段。尽管根据 BIRCH 的原始演示文稿将其标记为可选,但它仍然非常重要,因为通过此阶段,表示的 CF 树可以通过以下方式重建为较小的树:
- Group densely crowded sub-clusters into a larger cluster comprising of multiple data points stored in the tree as nodes (leaf or non-leaf).
- Removing abnormal diameter values so that data condensation can be carried forward smoothly.
- 第三阶段:这个阶段是全局聚类的别称。在这里,任何现有的聚类算法(如 K-Means 等)都适用于对位于 CF 树内的所有叶节点条目进行聚类。应用任何全球目前最新的聚类算法的原因?其中任何一个都可以让您灵活地指定所需的集群数量和质量集群所必需的直径阈值。
- 第四阶段:最后或第四阶段,也发音为集群精炼。在从第三阶段获得集群集之后,它们以重新分配的方式被进一步过滤到种子中(根据质心原则),以便获得更好版本的集群,以处理具有更大数据集的良好数据库。最后,识别并删除重复或异常的叶子或非叶子值,以便将更好的集群加载到内存中。
3. EM 聚类
在数据科学领域被称为可以很好地克服 K-Means、EM 或期望最大化聚类算法的缺点的解决方案,它使用高斯函数从可用数据集中直观地估计缺失值。然后,限制性地,它通过均值和标准差的优化值来塑造集群。
进行估计和优化的整个过程,直到获得与结果的可能性非常相似的单个聚类。现在让我们了解一下EM 聚类算法的步骤:
- Considering a set of parameters with the likelihood of randomness in the observations recorded. The prime purpose of selecting the variables randomly is quickly accessing many data clusters onto which estimation and maximization events will be performed.
- This is the next step known as Estimation. Here, data clusters formed are observed in a way that the values missing can be estimated through a probability distribution function (any of the Gaussian mixture models present such distribution keeping in mind the maximum likelihood of estimated values).
- Now, it’s time to perform the optimization technique via probability distribution function by computing parameters like mean and standard deviation of the datasets likely to be much closer to a selected cluster.
- At last, convergence which is a plain-programming method is given attention and the condition is met after steps 2, 3 are performed iteratively. The datasets used in estimation and optimization or maximization steps are probabilistically cross-checked to a point the difference between probabilities of the likelihood of their occurrences is negligible or almost zero. In case if required, we may repeat the calculations of estimated and expected values till the point of convergence meets in actuality. As soon as this point is identified, the algorithm (working on the use-observe-update principle) can be put to a halt and one can enjoy the accurate results promisingly removing all the inconsistencies.
4.层次聚类
层次聚类算法有时像魔术一样工作,您的任务是识别数据元素并根据聚类的可能性映射它们。现在,比较后映射的数据元素可能属于一个集群,其属性在多维尺度、交叉表或多个因素上的数据变量之间的定量关系方面是不同的。
考虑在合并可用集群后如何识别单个集群,同时牢记它们分类的特征层次结构?为此,可以看一下下面编写的层次聚类算法的步骤:
- Start with selecting the data points and map them as clusters in accordance with the hierarchy.
- Thinking about how the clusters will be interpreted? Here, a dendrogram can be used for interpreting well the hierarchy of clusters with a top-down or bottom-up approach.
- Clusters mapped are merged till a point a single cluster is left and for measuring the closeness between the clusters while merging them, we may use multiple metrics like Euclidean distance, Manhattan distance, or Mahalanobis distance.
- The algorithm is terminated for now since the intersection point is identified and mapped well on the dendrogram.
5. 基于密度的空间聚类
在识别集群时,带噪声的基于密度的空间聚类算法(或 DBSCAN)是比 K 均值更好的选择,只需在更大的空间数据库中交叉检查其数据点的密度。此外,它比CLARANS 更有吸引力,效率是 CLARANS 的 100 倍,即通过基于 Medoid 的分区方法对 LARge ApplicatioNS 进行聚类。由于其基于密度的集群识别概念,它因在实践和理论方面受到广泛关注而获奖。
思考这个算法使用什么基本概念?因此,这个屡获殊荣的空间数据聚类算法选择一个任意点,然后识别与该任意点附近的其他点。稍后,在任意点的帮助下识别的数据点被识别为一个集群,并且远离任意点的一个(称为噪声/异常值)用于识别集群的其他迭代。让我们更清楚地了解这个奖励算法的步骤:
- Begin with considering a large spatial database for discovering the clusters of arbitrary shapes. Within that space, we select an arbitrary point say p and then, proceed ahead with finding its nearest neighborhood data point like q via distance parameter ε.
- More data points (like q) can now be identified till a stage where a cluster of arbitrary shape and density is approximately identified. The number of those data points will come into the picture since their clustering has started with some value say 5 or more of minPts (minimum points required to form a density-based cluster). (Note: All points of a cluster are mutually densely connected. A point selected is a part of a cluster if it is densely reachable with some already existing point.)
- This is quite possible that not reachable points are reviewed during the clustering process. Instead of discarding them, they can be symbolized as noise/ outliers.
- Prefer repeating the above steps i.e. 2 & 3 so that the data points examined can become a part of the cluster having some shape and density and later, those labeled as noise will be visited later.
- In the end, the noise or outliers shall be visited for identifying their neighbor data points somewhere forming clusters that are lying in low-density regions. (Note: It is not mandatory to traverse outliers as they are visible in low-density regions.)
6. K-Means 聚类
K-Means 聚类算法在计算一对数据点之间的质心值后,迭代地识别出 k 个聚类。凭借其矢量量化观察,计算聚类质心非常有利,通过这些质心可以将可变特征的数据点引入聚类。
随着聚类过程的加快,许多未标记的真实世界数据现在将相对有效,因为它现在被分割成形状和密度不同的聚类。思考质心距离是如何计算的?看看下面列出的k 表示步骤:
- Select at first the number of clusters that may vary in shape and density. Let’s name that number k whose value you can choose like 3,4, or any other.
- Now, you may assign data points to the cluster’s number. Then, with the data point and cluster selected, the centroid distance is computed through the least squared Euclidean distance.
- If the data point is much closer to the centroid distance, then it resembles the cluster otherwise not.
- Keep computing the centroid distances iteratively with the selected data point till you identify a maximum number of clusters comprising of similar data points. The algorithm stops its clustering process as soon as guaranteed convergence (a point where data points are clustered well) is achieved.
7. 识别聚类结构的排序点
识别聚类算法结构的光学或排序点具有改进数据库编目的潜力。您可能会思考什么是数据库编目!因此,数据库编目是一种按顺序排列数据库列表的方法,数据库列表由位于集群中的数据集组成。
这些簇具有不同的密度和形状,因此它们的结构也各不相同。此外,OPTICS 的基本方法类似于基于密度的空间聚类算法(已在第 5 点中讨论),但同时,DBSCAN 的许多弱点得到了解决。
检测和解决 DBSCAN 弱点的主要原因是,现在您不必担心识别更密集的集群,这不是 DBSCAN 完成的。想看看这个算法是如何工作的吗?只需阅读以下步骤:
- Primitively, a set of unclassified data points can be reviewed as now there is no need for specifying the number of clusters. Then, you should select some arbitrary point like p and start computing the distance parameter ε for finding the neighborhood point.
- To proceed ahead with the clustering process, it is essential to find the minimum number of data points with which a densely-populated cluster can be formed. And that number can be denoted by variable minPts. Here, the process may stop if the new data point identified is greater than minPts.
- Keep on updating the values of ε and the current data point till the clusters of different densities are segmented well even better than DBSCAN.