Python 中的点云聚类分析 - 从二进制矩阵中识别聚类
Point-cloud cluster-analysis in Python - identifying clusters from binary matrix
例如,我有以下输入数据(我正在使用的点云更复杂)
Data = [1,1,1,1,1],[1,1,2,1,1],[2,2,2,1,1],[3,3,3,1,1],[4,4,4,1,1],[5,5,5,1,1],[50,50,50,1,1],[95,95,95,1,1],[96,96,96,1,1],[97,97,97,1,1],[98,98,98,1,1],[99,99,99,1,1],[2,2,3,1,1],[2,2,1,1,1],[2,2,4,1,1]
聚类算法给出了一个二元上三角矩阵(我们称之为连接矩阵)。 1 表示两点相连。例如。点 ID 0(第 0 行)连接到自身(第 0 列)和 1、2、3、12、13、14。但是第 4 点和第 5 点也可以通过 3、12、13 和 14 到达。
[ 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
我可以用 rowclustering(s) 识别每行的簇,其中 s 是上面的二进制矩阵。
def rowclustering(s):
r = 0
idx = []
while r < size(s,0):
row = []
for i in range(size(s,1)):
if s[r][i] == 1:
row = row + [i]
r = r + 1
idx = idx + [row]
return idx
返回的idx为:
idx = [[0, 1, 2, 3, 12, 13, 14], [1, 2, 3, 12, 13, 14], [2, 3, 4, 12, 13, 14], [3, 4, 5, 12, 13, 14], [4, 5, 12, 14], [5], [6], [7, 8, 9], [8, 9, 10], [9, 10, 11], [10, 11], [11], [12, 13, 14], [13, 14], [14]]
现在,显然集群少于 15 个,因为一些行通过公共 ID 连接(例如,查看 ID 4 和 5)。我想要的是:
result = [[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]
我试图创建一个函数 (clustering(idx,f) ) 来执行此操作,其中 idx 是 rowclustering(s) 的结果,而 f 将是 idx 中的第一行,例如[0、1、2、3、12、13、14]。但是,此功能将无法正常完成。在建立所有连接(idx ID)后中断功能的正确代码是什么?
def clustering(idx,f):
for i in f:
f = f + idx[i]
f = list(set(f))
clustering(idx,f)
return
我要解决的问题是一种自我成长的过程。函数聚类应该调用自身,直到建立所有可能的点连接。这可以在 idx 或(可能更好)连接矩阵(矩阵缩减?)上完成。
非常感谢任何帮助!让我知道我是否应该澄清我的问题。谢谢。
您的问题可以看作是寻找连通分量。您可以使用 networkx 来获得解决方案,或者您可以自己实施 BFS(广度优先搜索)。
import networkx as nx
import numpy as np
x = """
[ 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
"""
mat = eval('[' + x.replace('.', '.,').replace(']', '],') + ']')
mat = np.array(mat)
G = nx.from_numpy_matrix(np.array(mat))
print(list(nx.connected_components(G)))
[{0, 1, 2, 3, 4, 5, 12, 13, 14}, {6}, {7, 8, 9, 10, 11}]
编辑:
实际上,这个问题让我想起了很久以前读过的东西。这实际上可以仅使用矩阵运算来计算。它非常漂亮。您的初始矩阵是邻接矩阵 (A),我们还需要指定一个度矩阵 (D),它保存对角线上每个节点的度数。我们可以使用这些来定义拉普拉斯矩阵 (L),然后使用一些谱图理论。 (耶!)
# Make the undirected version of the graph (no self loops)
A = (mat + mat.T) * (1 - np.eye(mat.shape[0]))
# Make the degree matrix
D = np.diag(A.sum(axis=1) + A.sum(axis=0)) / 2
# thats all we need to define the laplacian
L = D - A
# The number of zeros eigenvalues of the Laplacian is exactly the number of CCs
np.isclose(np.linalg.eigvals(L), 0).sum()
3
# The connected compoments themselves are identified by rows that have the same nullspace vector
u, s, vh = np.linalg.svd(L)
ns = vh[(s >= 1e-13).sum():].conj().T
array([[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.19222441, 0.97663659, 0.09607676],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ]])
现在,我们计算出了答案!解释起来有点奇怪。一点点处理就可以将其转换为您想要的表示形式。
# the following is a little numpy trick to find unique rows
# chopping off the last few decimal places to account for numerical errors
ns_ = np.ascontiguousarray(np.round(ns, 8)).view(np.dtype((np.void, ns.dtype.itemsize * ns.shape[1])))
ns_basis, row_to_cc_id = np.unique(ns_, return_inverse=True)
# Finally we can just use this to convert to the desired output format
groups = [[] for _ in range(len(ns_basis))]
for row, id in enumerate(row_to_cc_id):
groups[id].append(row)
print(groups)
[[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]
例如,我有以下输入数据(我正在使用的点云更复杂)
Data = [1,1,1,1,1],[1,1,2,1,1],[2,2,2,1,1],[3,3,3,1,1],[4,4,4,1,1],[5,5,5,1,1],[50,50,50,1,1],[95,95,95,1,1],[96,96,96,1,1],[97,97,97,1,1],[98,98,98,1,1],[99,99,99,1,1],[2,2,3,1,1],[2,2,1,1,1],[2,2,4,1,1]
聚类算法给出了一个二元上三角矩阵(我们称之为连接矩阵)。 1 表示两点相连。例如。点 ID 0(第 0 行)连接到自身(第 0 列)和 1、2、3、12、13、14。但是第 4 点和第 5 点也可以通过 3、12、13 和 14 到达。
[ 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
我可以用 rowclustering(s) 识别每行的簇,其中 s 是上面的二进制矩阵。
def rowclustering(s):
r = 0
idx = []
while r < size(s,0):
row = []
for i in range(size(s,1)):
if s[r][i] == 1:
row = row + [i]
r = r + 1
idx = idx + [row]
return idx
返回的idx为:
idx = [[0, 1, 2, 3, 12, 13, 14], [1, 2, 3, 12, 13, 14], [2, 3, 4, 12, 13, 14], [3, 4, 5, 12, 13, 14], [4, 5, 12, 14], [5], [6], [7, 8, 9], [8, 9, 10], [9, 10, 11], [10, 11], [11], [12, 13, 14], [13, 14], [14]]
现在,显然集群少于 15 个,因为一些行通过公共 ID 连接(例如,查看 ID 4 和 5)。我想要的是:
result = [[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]
我试图创建一个函数 (clustering(idx,f) ) 来执行此操作,其中 idx 是 rowclustering(s) 的结果,而 f 将是 idx 中的第一行,例如[0、1、2、3、12、13、14]。但是,此功能将无法正常完成。在建立所有连接(idx ID)后中断功能的正确代码是什么?
def clustering(idx,f):
for i in f:
f = f + idx[i]
f = list(set(f))
clustering(idx,f)
return
我要解决的问题是一种自我成长的过程。函数聚类应该调用自身,直到建立所有可能的点连接。这可以在 idx 或(可能更好)连接矩阵(矩阵缩减?)上完成。
非常感谢任何帮助!让我知道我是否应该澄清我的问题。谢谢。
您的问题可以看作是寻找连通分量。您可以使用 networkx 来获得解决方案,或者您可以自己实施 BFS(广度优先搜索)。
import networkx as nx
import numpy as np
x = """
[ 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
"""
mat = eval('[' + x.replace('.', '.,').replace(']', '],') + ']')
mat = np.array(mat)
G = nx.from_numpy_matrix(np.array(mat))
print(list(nx.connected_components(G)))
[{0, 1, 2, 3, 4, 5, 12, 13, 14}, {6}, {7, 8, 9, 10, 11}]
编辑:
实际上,这个问题让我想起了很久以前读过的东西。这实际上可以仅使用矩阵运算来计算。它非常漂亮。您的初始矩阵是邻接矩阵 (A),我们还需要指定一个度矩阵 (D),它保存对角线上每个节点的度数。我们可以使用这些来定义拉普拉斯矩阵 (L),然后使用一些谱图理论。 (耶!)
# Make the undirected version of the graph (no self loops)
A = (mat + mat.T) * (1 - np.eye(mat.shape[0]))
# Make the degree matrix
D = np.diag(A.sum(axis=1) + A.sum(axis=0)) / 2
# thats all we need to define the laplacian
L = D - A
# The number of zeros eigenvalues of the Laplacian is exactly the number of CCs
np.isclose(np.linalg.eigvals(L), 0).sum()
3
# The connected compoments themselves are identified by rows that have the same nullspace vector
u, s, vh = np.linalg.svd(L)
ns = vh[(s >= 1e-13).sum():].conj().T
array([[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.19222441, 0.97663659, 0.09607676],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.01778075, -0.04721352, 0.44435878],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ],
[-0.32684842, -0.06239247, -0.0197079 ]])
现在,我们计算出了答案!解释起来有点奇怪。一点点处理就可以将其转换为您想要的表示形式。
# the following is a little numpy trick to find unique rows
# chopping off the last few decimal places to account for numerical errors
ns_ = np.ascontiguousarray(np.round(ns, 8)).view(np.dtype((np.void, ns.dtype.itemsize * ns.shape[1])))
ns_basis, row_to_cc_id = np.unique(ns_, return_inverse=True)
# Finally we can just use this to convert to the desired output format
groups = [[] for _ in range(len(ns_basis))]
for row, id in enumerate(row_to_cc_id):
groups[id].append(row)
print(groups)
[[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]