python 中的图谱聚类

Spectral Clustering a graph in python

我想使用谱聚类对 python 中的图进行聚类。

谱聚类是一种更通用的技术,不仅可以应用于图形,还可以应用于图像或任何类型的数据,但是,它被认为是一种特殊的 graph 聚类技术.遗憾的是,我无法在 python 在线找到谱聚类图的示例。

我很想知道如何解决这个问题。如果有人能帮我弄清楚,我可以将文档添加到scikit learn。

备注:

没有太多的光谱聚类经验,只是看文档(跳到最后查看结果!):

代码:

import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)

# Get your mentioned graph
G = nx.karate_club_graph()

# Get ground-truth: club-labels -> transform to 0/1 np-array
#     (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])

# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)

print('ground truth')
print(gt)

# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)

# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))

# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))

输出:

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828

总体思路:

来自here的数据和任务介绍:

The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.

使用 sklearn 和光谱聚类来解决这个问题:

If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.

This 将归一化图形切割描述为:

Find two disjoint partitions A and B of the vertices V of a graph, so that A ∪ B = V and A ∩ B = ∅

Given a similarity measure w(i,j) between two vertices (e.g. identity when they are connected) a cut value (and its normalized version) is defined as: cut(A, B) = SUM u in A, v in B: w(u, v)

...

we seek the minimization of disassociation between the groups A and B and the maximization of the association within each group

听起来不错。因此,我们创建邻接矩阵 (nx.to_numpy_matrix(G)) 并将参数 affinity 设置为 precomputed (因为我们的邻接矩阵是我们预先计算的相似性度量)。

Alternatively, using precomputed, a user-provided affinity matrix can be used.

编辑:虽然对此不熟悉,但我寻找了参数来调整并找到了assign_labels

The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.

所以尝试不太敏感的方法:

sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')

输出:

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351

这与基本事实非常吻合!

这是一个虚拟示例,只是为了了解它对简单相似度矩阵的作用——灵感来自 sascha 的回答。

代码

import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)

adj_mat = [[3,2,2,0,0,0,0,0,0],
           [2,3,2,0,0,0,0,0,0],
           [2,2,3,1,0,0,0,0,0],
           [0,0,1,3,3,3,0,0,0],
           [0,0,0,3,3,3,0,0,0],
           [0,0,0,3,3,3,1,0,0],
           [0,0,0,0,0,1,3,1,1],
           [0,0,0,0,0,0,1,3,1],
           [0,0,0,0,0,0,1,1,3]]

adj_mat = np.array(adj_mat)

sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)

print('spectral clustering')
print(sc.labels_)

输出

spectral clustering
[0 0 0 1 1 1 2 2 2]

让我们首先将图 G 聚类为 K=2 个聚类,然后对所有 K 进行泛化。

  • 我们可以使用networkx中的函数linalg.algebraicconnectivity.fiedler_vector()来计算Fiedler向量(对应于Graph Laplacian矩阵的第二小特征值的特征向量)假设该图是连通的无向图。

    然后我们可以对特征向量的值进行阈值计算每个节点对应的簇索引,如下代码块所示:

    import networkx as nx
    import numpy as np
    
    A = np.zeros((11,11))
    A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
    A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
    A[0,5] = 5
    
    G = nx.from_numpy_matrix(A)
    ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
    labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
    labels
    # [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
    nx.draw(G, pos=nx.drawing.layout.spring_layout(G), 
               with_labels=True, node_color=labels)  
    

  • 我们也可以通过拉普拉斯图的特征分析得到相同的聚类,然后选择第二小特征值对应的特征向量:

    L = nx.laplacian_matrix(G)
    e, v = np.linalg.eig(L.todense()) 
    idx = np.argsort(e)
    e = e[idx]
    v = v[:,idx]
    labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
    labels
    # [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
    

    用标记的聚类再次绘制图形:

  • 使用 sklearn.cluster 中的 SpectralClustering 我们可以获得完全相同的结果:

    sc = SpectralClustering(2, affinity='precomputed', n_init=100)
    sc.fit(A)
    sc.labels_
    # [0 0 0 0 0 1 1 1 1 1 1]
    

  • 我们可以将上面的 K > 2 个集群概括如下(使用 kmeans 集群来划分 Fiedler 向量而不是阈值):

    以下代码演示了如何使用 k 均值聚类来划分 Fiedler 向量并获得由以下邻接矩阵定义的图的 3 聚类:

    A = np.array([[3,2,2,0,0,0,0,0,0],
             [2,3,2,0,0,0,0,0,0],
             [2,2,3,1,0,0,0,0,0],
             [0,0,1,3,3,3,0,0,0],
             [0,0,0,3,3,3,0,0,0],
             [0,0,0,3,3,3,1,0,0],
             [0,0,0,0,0,1,3,1,1],
             [0,0,0,0,0,0,1,3,1],
             [0,0,0,0,0,0,1,1,3]])
    
    K = 3 # K clusters
    G = nx.from_numpy_matrix(A)
    ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
    kmeans.labels_
    # array([2, 2, 2, 0, 0, 0, 1, 1, 1])
    

    现在绘制聚类图,用上面获得的聚类标记节点: