显示 scipy 树状图的聚类标签
Display cluster labels for a scipy dendrogram
我正在使用层次聚类对词向量进行聚类,我希望用户能够显示显示聚类的树状图。但是,由于可能有数千个单词,我希望将此树状图截断为一些合理的值,每个叶子的标签是该簇中最重要单词的字符串。
我的问题是,according to the docs,"The labels[i] value is the text to put under the ith leaf node only if it corresponds to an original observation and not a non-singleton cluster."我的意思是我不能标记簇,只能标记奇异点?
为了说明,这里有一个简短的 python 脚本,它生成一个简单的标记树状图:
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')
labelList = ["foo" for i in range(0, 20)]
plt.figure(figsize=(15, 12))
dendrogram(
linked,
orientation='right',
labels=labelList,
distance_sort='descending',
show_leaf_counts=False
)
plt.show()
现在假设我想截断为 5 个叶子,并且对于每个叶子,将其标记为 "foo, foo, foo...",即构成该簇的单词。 (注意:生成这些标签不是这里的问题。)我截断它,并提供一个标签列表来匹配:
labelList = ["foo, foo, foo..." for i in range(0, 5)]
dendrogram(
linked,
orientation='right',
p=5,
truncate_mode='lastp',
labels=labelList,
distance_sort='descending',
show_leaf_counts=False
)
这就是问题所在,没有标签:
我想参数 'leaf_label_func' 可能在这里有用,但我不确定如何使用它。
您关于使用 leaf_label_func 参数的说法是正确的。
除了创建绘图之外,树状图函数 return 还是一个包含多个列表的字典(他们在文档中将其称为 R)。您创建的 leaf_label_func 必须从 R["leaves"] 和 return 中获取所需标签的值。设置标签的最简单方法是 运行 树状图两次。使用 no_plot=True
获取用于创建标签映射的字典。然后再次创建情节。
randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')
labels = ["A", "B", "C", "D"]
p = len(labels)
plt.figure(figsize=(8,4))
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20)
plt.xlabel('Look at my fancy labels!', fontsize=16)
plt.ylabel('distance', fontsize=16)
# call dendrogram to get the returned dictionary
# (plotting parameters can be ignored at this point)
R = dendrogram(
linked,
truncate_mode='lastp', # show only the last p merged clusters
p=p, # show only the last p merged clusters
no_plot=True,
)
print("values passed to leaf_label_func\nleaves : ", R["leaves"])
# create a label dictionary
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))}
def llf(xx):
return "{} - custom label!".format(temp[xx])
## This version gives you your label AND the count
# temp = {R["leaves"][ii]:(labels[ii], R["ivl"][ii]) for ii in range(len(R["leaves"]))}
# def llf(xx):
# return "{} - {}".format(*temp[xx])
dendrogram(
linked,
truncate_mode='lastp', # show only the last p merged clusters
p=p, # show only the last p merged clusters
leaf_label_func=llf,
leaf_rotation=60.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
)
plt.show()
你可以简单地写:
hierarchy.dendrogram(Z, labels=label_list)
这是一个很好的例子,使用 pandas 数据框:
import numpy as np
import pandas as pd
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt
data = [[24, 16], [13, 4], [24, 11], [34, 18], [41,
6], [35, 13]]
frame = pd.DataFrame(np.array(data), columns=["Rape",
"Murder"], index=["Atlanta", "Boston", "Chicago",
"Dallas", "Denver", "Detroit"])
Z = hierarchy.linkage(frame, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z, labels=frame.index)
在我看来@coradek 的回答有点错误,虽然它很有帮助
我使用了他的代码(df 作为 pandas DataFrame)并进行了更正:
plt.figure(figsize=(20,10))
labelList = df.apply(lambda x: f"{x['...']}",axis=1)
Z = linkage(df[["..."]])
R = dendrogram(Z,no_plot=True)
labelDict = {leaf: labelList[leaf] for leaf in R["leaves"]}
dendrogram(Z,leaf_label_func=lambda x:labelDict[x])
plt.show()
因为上面给出的代码总是给我相同的报价顺序
我正在使用层次聚类对词向量进行聚类,我希望用户能够显示显示聚类的树状图。但是,由于可能有数千个单词,我希望将此树状图截断为一些合理的值,每个叶子的标签是该簇中最重要单词的字符串。
我的问题是,according to the docs,"The labels[i] value is the text to put under the ith leaf node only if it corresponds to an original observation and not a non-singleton cluster."我的意思是我不能标记簇,只能标记奇异点?
为了说明,这里有一个简短的 python 脚本,它生成一个简单的标记树状图:
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')
labelList = ["foo" for i in range(0, 20)]
plt.figure(figsize=(15, 12))
dendrogram(
linked,
orientation='right',
labels=labelList,
distance_sort='descending',
show_leaf_counts=False
)
plt.show()
现在假设我想截断为 5 个叶子,并且对于每个叶子,将其标记为 "foo, foo, foo...",即构成该簇的单词。 (注意:生成这些标签不是这里的问题。)我截断它,并提供一个标签列表来匹配:
labelList = ["foo, foo, foo..." for i in range(0, 5)]
dendrogram(
linked,
orientation='right',
p=5,
truncate_mode='lastp',
labels=labelList,
distance_sort='descending',
show_leaf_counts=False
)
这就是问题所在,没有标签:
我想参数 'leaf_label_func' 可能在这里有用,但我不确定如何使用它。
您关于使用 leaf_label_func 参数的说法是正确的。
除了创建绘图之外,树状图函数 return 还是一个包含多个列表的字典(他们在文档中将其称为 R)。您创建的 leaf_label_func 必须从 R["leaves"] 和 return 中获取所需标签的值。设置标签的最简单方法是 运行 树状图两次。使用 no_plot=True
获取用于创建标签映射的字典。然后再次创建情节。
randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')
labels = ["A", "B", "C", "D"]
p = len(labels)
plt.figure(figsize=(8,4))
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20)
plt.xlabel('Look at my fancy labels!', fontsize=16)
plt.ylabel('distance', fontsize=16)
# call dendrogram to get the returned dictionary
# (plotting parameters can be ignored at this point)
R = dendrogram(
linked,
truncate_mode='lastp', # show only the last p merged clusters
p=p, # show only the last p merged clusters
no_plot=True,
)
print("values passed to leaf_label_func\nleaves : ", R["leaves"])
# create a label dictionary
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))}
def llf(xx):
return "{} - custom label!".format(temp[xx])
## This version gives you your label AND the count
# temp = {R["leaves"][ii]:(labels[ii], R["ivl"][ii]) for ii in range(len(R["leaves"]))}
# def llf(xx):
# return "{} - {}".format(*temp[xx])
dendrogram(
linked,
truncate_mode='lastp', # show only the last p merged clusters
p=p, # show only the last p merged clusters
leaf_label_func=llf,
leaf_rotation=60.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
)
plt.show()
你可以简单地写:
hierarchy.dendrogram(Z, labels=label_list)
这是一个很好的例子,使用 pandas 数据框:
import numpy as np
import pandas as pd
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt
data = [[24, 16], [13, 4], [24, 11], [34, 18], [41,
6], [35, 13]]
frame = pd.DataFrame(np.array(data), columns=["Rape",
"Murder"], index=["Atlanta", "Boston", "Chicago",
"Dallas", "Denver", "Detroit"])
Z = hierarchy.linkage(frame, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z, labels=frame.index)
在我看来@coradek 的回答有点错误,虽然它很有帮助
我使用了他的代码(df 作为 pandas DataFrame)并进行了更正:
plt.figure(figsize=(20,10))
labelList = df.apply(lambda x: f"{x['...']}",axis=1)
Z = linkage(df[["..."]])
R = dendrogram(Z,no_plot=True)
labelDict = {leaf: labelList[leaf] for leaf in R["leaves"]}
dendrogram(Z,leaf_label_func=lambda x:labelDict[x])
plt.show()
因为上面给出的代码总是给我相同的报价顺序