如何在 python 中做数据相关聚类图

Question

我有一个数据库，其中包含有关对回购所做的提交的信息。例如

commit-sha1 | file1 | 
commit-sha1 | file2 |
commit-sha2 | file2 |
commit-sha2 | file3 |

等等。基本上，显示 sha1 更改了文件 (file1、file2) 和 sha2 更改了 (file2、file3) 现在我想看看某些文件是否相互关联，即 file1 和 file2 一起提交的可能性有多大等等。为此，首先我找到了最常提交的前 50 个文件，这给了我

file1 - 1500
file2 - 1423
file3 - 1222..

对于每个文件f，计算P(f) = commits containing f / total 提交。
对于每对文件 f1、f2，计算 Q(f1, f2) = commits 包含 f1、f2 / 总提交
对于每对文件f1,f2,计算D(f1,f2)=P(f1)*P(f2)/ [Q(f1, f2) – P(f1) * P(f2)] 或无穷大如果 Q(f1, f2) <= P(f1) * P(f2) 在我按照上面的操作之后，我现在有 2 对文件和它们的 D(f1, f2) 值，看起来像这样

two_pair_list = [['file1', 'file2'], ['file1', 'file3']...['file49', 'file50']]

d_value = [3.2, -1, 0.12, 7.6, -1, ...]

当 Q(f1, f2) <= P(f1) * P(f2) 时，我将 -1 设为 d_value 即例如，因为数据库中没有包含两个文件 1 的提交和 file3 一起（即 Q(file1, file3) = 0），其 d_value 为 -1。现在假设我有成对文件的 d_value 列表，我如何执行层次聚类来查看哪些文件是相关的？我相信 python 的 linkage() API 会有所帮助，但我不确定如何将它用于此数据。任何帮助表示赞赏谢谢

Answer 1

一个简单的例子：

from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
from matplotlib import pyplot as plt

d_value = np.array([ 3.2 , 100,  0.12,  7.6 , 100,  5.2 ])
Z = linkage(dm, 'ward')
fig = plt.figure()
dn = dendrogram(Z)

结果：

请注意，我已将您的 -1 更改为 100，因为当 file1 和 file3 未一起提交时，它们的距离应该很大。

如何在 python 中做数据相关聚类图

How to do data correlation clustering plot in python

python

cluster-analysis

hierarchical-clustering