在Python使用 SciPy 将分层树状图切割成簇
在本文中,我们将看到如何使用Python的SciPy 通过阈值将分层树状图切割成簇。
树状图是一种树形图,显示了层次聚类,即相似数据集之间的关系。用于分析不同类之间的层次关系。 scipy.cluster包为我们提供了层次聚类和树状图绘制所需的工具。因此,必须导入到环境中。
让我们首先创建一些示例数据并正常绘制它。我们将一堆随机数据点作为我们的输入,稍后我们将绘制它们的树状图。
示例:用于创建和可视化数据的示例程序
Python3
# Importing the libraries
from scipy.cluster import hierarchy
from scipy.cluster.hierarchy import dendrogram
import numpy as np
import matplotlib.pyplot as plt
# The data points are given as list of lists
data = np.array([
[1, 4],
[2, 2],
[3, 7],
[4, 6],
[5, 1],
[6, 3],
[8, 10],
[9, 11]
])
# Taking transpose
x, y = data.T
# plot our list in X,Y coordinates
plt.scatter(x, y)
plt.show()
Python3
# Creating Dendrogram for our data
# Z = linkage matrix
Z = hierarchy.linkage(data, method='average')
plt.figure()
plt.title("Dendrograms")
# Dendrogram plotting using linkage matrix
dendrogram = hierarchy.dendrogram(Z)
Python3
# Creating Dendrogram for our data
# max_d = cut-off/ Threshold value
max_d = 4
Z = hierarchy.linkage(data, method='average')
plt.figure()
plt.title("Dendrograms")
dendrogram = hierarchy.dendrogram(Z)
# Cutting the dendrogram at max_d
plt.axhline(y=max_d, c='k')
输出:
可以使用链接矩阵轻松绘制树状图。链接矩阵是通过 links()函数创建的。该矩阵包含层次聚类的编码以呈现为树状图。
Syntax:
hierarchy.linkage(y, method=’single’, metric=’euclidean’, optimal_ordering=False):
Parameters:
- y: Input 1D/ 2D array of input vector
- method: methods for calculating the distance between the newly formed cluster and other points. method = ‘single’ , ‘complete’, ‘average’, ‘centroid’
- metric: distance metric to use in the case that input is a collection of observation vectors
- optimal_ordering: If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal
示例:为我们的数据创建树状图
蟒蛇3
# Creating Dendrogram for our data
# Z = linkage matrix
Z = hierarchy.linkage(data, method='average')
plt.figure()
plt.title("Dendrograms")
# Dendrogram plotting using linkage matrix
dendrogram = hierarchy.dendrogram(Z)
输出:
现在,让我们通过阈值来切割树状图。我们选择了一个截止值或阈值 4。在这个值上,也可以绘制一条垂直线。
对于选定的截止/阈值,我们总是可以简单地计算与树状图垂直线的交点数,以获得形成的簇数。假设我们选择 max_d = 6 的截止值,我们将得到 2 个最终集群。
示例:在阈值处切割树状图
蟒蛇3
# Creating Dendrogram for our data
# max_d = cut-off/ Threshold value
max_d = 4
Z = hierarchy.linkage(data, method='average')
plt.figure()
plt.title("Dendrograms")
dendrogram = hierarchy.dendrogram(Z)
# Cutting the dendrogram at max_d
plt.axhline(y=max_d, c='k')
输出: