聚类是无监督机器学习中的一种技术,它基于可用于数据集中数据点的信息的相似性将数据点分组为聚类。属于相同群集的数据点在某些方面彼此相似,而属于不同群集的数据项则不同。
K-means和DBScan (带噪声的应用程序的基于密度的空间聚类)是无监督机器学习中最流行的两种聚类算法。
1. K-Means聚类:
K均值是基于质心或基于分区的聚类算法。该算法将样本空间中的所有点划分为K个相似度组。通常使用欧几里德距离来衡量相似性。
算法如下:
算法:
- K个质心随机放置,每个群集一个。
- 计算每个点到每个质心的距离
- 每个数据点都分配给它最接近的质心,从而形成一个簇。
- 重新计算K重心的位置。
2. DBScan集群:
DBScan是基于密度的聚类算法。该算法的关键事实是,在给定半径(R)内的群集中每个点的邻域必须具有最小数量的点(M)。实践证明,该算法在检测异常值和处理噪声方面非常有效。
算法如下:
算法:
- 确定每个点的类型。我们的数据集中的每个数据点可能是以下之一:
- 核心点:如果在其附近(即在指定半径(R)内)至少有M个点,则数据点为核心点。
- 边界点:如果满足以下条件,则数据点被归类为BORDER点:
- 其邻域包含少于M个数据点,或者
- 它可以从某个核心点到达,即距核心点R距离之内。
- 离群点:离群点是不是核心点的点,并且距离核心点还不够近。
- 离群点被消除。
- 邻居的核心点已连接并放置在同一群集中。
- 边界点分配给每个群集。
K-means和DBScan之间存在一些显着差异。
S.No. | K-means Clustering | DBScan Clustering |
---|---|---|
1. | Clusters formed are more or less spherical or convex in shape and must have same feature size. | Clusters formed are arbitrary in shape and may not have same feature size. |
2. | K-means clustering is sensitive to the number of clusters specified. | Number of clusters need not be specified. |
3. | K-means Clustering is more efficient for large datasets. | DBSCan Clustering can not efficiently handle high dimensional datasets. |
4. | K-means Clustering does not work well with outliers and noisy datasets. | DBScan clustering efficiently handles outliers and noisy datasets. |
5. | In the domain of anomaly detection, this algorithm causes problems as anomalous points will be assigned to the same cluster as “normal” data points. | DBScan algorithm, on the other hand, locates regions of high density that are separated from one another by regions of low density. |
6. | It requires one parameter : Number of clusters (K) |
It requires two parameters : Radius(R) and Minimum Points(M) R determines a chosen radius such that if it includes enough points within it, it is a dense area. M determines the minimum number of data points required in a neighborhood to be defined as a cluster. |
7. | Varying densities of the data points doesn’t affect K-means clustering algorithm. | DBScan clustering does not work very well for sparse datasets or for data points with varying density. |