list of words 聚类列表

Question

假设我有一个单词列表列表，例如

[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes','orange'],
 ['potatoes','rice']]

集合要大得多。我想聚类通常存在在一起的词将具有相同的聚类的词。所以在这种情况下，集群将是 ['apple', 'banana', 'orange'] 和 ['rice','potatoes'].
存档此类集群的最佳方法是什么？

Answer 1

我认为把问题看成图表更自然。

您可以假设 apple 是节点 0，banana 是节点 1，第一个列表表示 0 到 1 之间有一条边。

所以首先将标签转换为数字：

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['apple','banana','orange','rice','potatoes'])

现在：

l=[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes'], #I deleted orange as edge is between 2 points, you can  transform the triple to 3 pairs or think of different solution
 ['potatoes','rice']]

将标签转换为数字：

edges=[le.transform(x) for x in l]

>>edges

[array([0, 1], dtype=int64),
array([0, 2], dtype=int64),
array([1, 2], dtype=int64),
array([4, 3], dtype=int64),
array([3, 4], dtype=int64)]

现在，开始构建图形并添加边：

import networkx as nx #graphs package
G=nx.Graph() #create the graph and add edges
for e in edges:
    G.add_edge(e[0],e[1])

现在您可以使用connected_component_subgraphs函数来分析连接的顶点。

components = nx.connected_component_subgraphs(G) #analyze connected subgraphs
comp_dict = {idx: comp.nodes() for idx, comp in enumerate(components)}
print(comp_dict)

输出：

{0: [0, 1, 2], 1: [3, 4]}

或

print([le.inverse_transform(v) for v in comp_dict.values()])

输出：

[数组(['apple', 'banana', 'orange']), 数组(['potatoes', 'rice'])]

那是你的 2 个集群。

Answer 2

因此，经过大量谷歌搜索后，我发现我实际上无法使用聚类技术，因为我缺少可以聚类单词的特征变量。如果我创建一个 table，其中我注意到每个词与其他词（实际上是笛卡尔积）存在的频率实际上是邻接矩阵，并且聚类在其上效果不佳。

所以，我一直在寻找的解决方案是图形社区检测。我使用 igraph 库（或 python 的 python-ipgraph 包装器）来查找集群，它运行得非常快。

更多信息：

类似问题：https://stats.stackexchange.com/questions/142297/finding-natural-groups-clusters-in-an-undirected-graph-over-several-undirect
坐标纸中的社区检测：https://arxiv.org/pdf/0906.0612.pdf
各种算法的基本描述：What are the differences between community detection algorithms in igraph?

Answer 3

寻找频繁项集会更有意义。

如果您将这样 简短的 词集聚类，通常只有几个级别的所有内容都会连接起来：没有共同点，一个共同点，两个共同点。这太粗糙而不能用于聚类。您将获得所有连接或没有任何连接，并且结果可能对数据更改和排序高度敏感。

因此放弃了对数据进行分区的范例 - 而是寻找频繁的组合。

list of words 聚类列表

List of lists of words clustering

python

information-retrieval

cluster-analysis

machine-learning