如何将距离数据插入 scipy 的凝聚聚类方法？

Question

所以，我有一组文本想对其进行聚类分析。我在每个文本之间取了 Normalized Compression Distance，现在我基本上构建了一个带有加权边的完整图，看起来像这样：

text1, text2, 0.539
text2, text3, 0.675

我很难找出将这些数据插入 scipy 的层次聚类方法的最佳方法。我大概可以将距离数据转换成 table，就像 this page 上的那样。我如何格式化此数据，以便可以轻松地将其插入 scipy 的 HAC 代码？

Answer 1

将数据转换为 table 就像 linked 页面上的数据（冗余距离矩阵）一样，您走在了正确的轨道上。根据文档，您应该能够将其直接传递给 scipy.cluster.hierarchy.linkage 或相关函数，例如 scipy.cluster.hierarchy.single 或 scipy.cluster.hierarchy.complete。相关函数明确指定应如何计算簇之间的距离。 scipy.cluster.hierarchy.linkage 允许您指定您想要的任何方法，但默认为单一 link（即两个簇之间的距离是它们最近点之间的距离）。所有这些方法都将 return 表示凝聚聚类的多维数组。然后，您可以使用 scipy.cluster.hierarchy 模块的其余部分对该聚类执行各种操作，例如可视化或展平它。

但是，有一个问题。截止到 this question was written, you couldn't actually use a redundant distance matrix, despite the fact that the documentation says you can. Based on the fact that the github issue 仍然开放，我认为这还没有解决。正如 linked 问题的答案中所指出的，您可以通过将完整的距离矩阵传递给 scipy.spatial.distance.squareform 函数来解决这个问题，该函数会将其转换为实际接受的格式（a平面阵列包含距离矩阵的上三角部分，称为压缩距离矩阵）。然后，您可以将结果传递给 scipy.cluster.hierarchy 函数之一。

如何将距离数据插入 scipy 的凝聚聚类方法？

How do I plug distance data into scipy's agglomerative clustering methods?

numpy

machine-learning

hierarchical-clustering

scipy