毫升 | K-means++ 算法
先决条件: K-means 聚类 - 简介
标准K-means算法的缺点:
K-means 算法的一个缺点是它对质心或平均点的初始化很敏感。因此,如果一个质心被初始化为一个“远离”的点,它可能最终没有与之关联的点,同时,多个集群可能最终与一个质心相关联。类似地,多个质心可能被初始化到同一个集群中,从而导致聚类效果不佳。例如,考虑下面显示的图像。
质心初始化不佳导致聚类效果不佳。
聚类应该是这样的:
K-均值++:
为了克服上述缺点,我们使用 K-means++。该算法确保了质心的更智能初始化并提高了聚类质量。除初始化外,其余算法与标准 K-means 算法相同。也就是说,K-means++ 是标准的 K-means 算法,并结合了更智能的质心初始化。
初始化算法:
涉及的步骤是:
- Randomly select the first centroid from the data points.
- For each data point compute its distance from the nearest, previously chosen centroid.
- Select the next centroid from the data points such that the probability of choosing a point as centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
- Repeat steps 2 and 3 until k centroids have been sampled
直觉:
按照上述初始化过程,我们选取了彼此远离的质心。这增加了最初拾取位于不同集群中的质心的机会。此外,由于质心是从数据点中提取的,所以每个质心最后都有一些与之相关的数据点。
执行:
考虑具有以下分布的数据集:
代码:KMean++ 算法的Python代码
Python3
# importing dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
# creating data
mean_01 = np.array([0.0, 0.0])
cov_01 = np.array([[1, 0.3], [0.3, 1]])
dist_01 = np.random.multivariate_normal(mean_01, cov_01, 100)
mean_02 = np.array([6.0, 7.0])
cov_02 = np.array([[1.5, 0.3], [0.3, 1]])
dist_02 = np.random.multivariate_normal(mean_02, cov_02, 100)
mean_03 = np.array([7.0, -5.0])
cov_03 = np.array([[1.2, 0.5], [0.5, 1,3]])
dist_03 = np.random.multivariate_normal(mean_03, cov_01, 100)
mean_04 = np.array([2.0, -7.0])
cov_04 = np.array([[1.2, 0.5], [0.5, 1,3]])
dist_04 = np.random.multivariate_normal(mean_04, cov_01, 100)
data = np.vstack((dist_01, dist_02, dist_03, dist_04))
np.random.shuffle(data)
# function to plot the selected centroids
def plot(data, centroids):
plt.scatter(data[:, 0], data[:, 1], marker = '.',
color = 'gray', label = 'data points')
plt.scatter(centroids[:-1, 0], centroids[:-1, 1],
color = 'black', label = 'previously selected centroids')
plt.scatter(centroids[-1, 0], centroids[-1, 1],
color = 'red', label = 'next centroid')
plt.title('Select % d th centroid'%(centroids.shape[0]))
plt.legend()
plt.xlim(-5, 12)
plt.ylim(-10, 15)
plt.show()
# function to compute euclidean distance
def distance(p1, p2):
return np.sum((p1 - p2)**2)
# initialization algorithm
def initialize(data, k):
'''
initialized the centroids for K-means++
inputs:
data - numpy array of data points having shape (200, 2)
k - number of clusters
'''
## initialize the centroids list and add
## a randomly selected data point to the list
centroids = []
centroids.append(data[np.random.randint(
data.shape[0]), :])
plot(data, np.array(centroids))
## compute remaining k - 1 centroids
for c_id in range(k - 1):
## initialize a list to store distances of data
## points from nearest centroid
dist = []
for i in range(data.shape[0]):
point = data[i, :]
d = sys.maxsize
## compute distance of 'point' from each of the previously
## selected centroid and store the minimum distance
for j in range(len(centroids)):
temp_dist = distance(point, centroids[j])
d = min(d, temp_dist)
dist.append(d)
## select data point with maximum distance as our next centroid
dist = np.array(dist)
next_centroid = data[np.argmax(dist), :]
centroids.append(next_centroid)
dist = []
plot(data, np.array(centroids))
return centroids
# call the initialize function to get the centroids
centroids = initialize(data, k = 4)
输出:
注意:虽然 K-means++ 中的初始化在计算上比标准 K-means 算法更昂贵,但 K-means++ 收敛到最优的运行时间大大减少。这是因为最初选择的质心很可能已经位于不同的集群中。