Pandas dataframe groupby 两列中出现的文本值

Question

我的数据框如下所示：

     v1           v2        distance
0   be          belong      0.666667
4   increase    decrease    0.666667
9   analyze     assay       0.666667
11  bespeak     circulate   0.769231
21  induce      generate    0.800000
24  decrease    delay       0.750000
26  cause       trip        0.666667
27  isolate     distinguish 0.750000
28  give        infect      0.666667
29  result      prove       0.800000
31  describe    explain     0.714286
33  report      circulate   0.666667
36  affect      expose      0.666667
40  explain     intercede   0.705882
41  suppress    restrict    0.833333

与v1和v2是动词，distance是它们的相似之处。我想根据相似词在数据框中的出现创建聚类。

例如，单词 circulate 看起来与 bespeak 和 report 相似。所以我想要一组这 3 个词。 Groupby 没有帮助，因为它们是字符串值。有人可以帮忙吗？

Answer 1

这似乎是一个图形问题。

您可以尝试使用 networkx:

import networkx as nx

G = nx.from_pandas_edgelist(df, 'v1', 'v2')

clusters = nx.connected_components(G)

输出：

[{'be', 'belong'}, {'delay', 'increase', 'decrease'}, {'analyze', 'assay'},
 {'report', 'bespeak', 'circulate'}, {'induce', 'generate'}, {'trip', 'cause'},
 {'distinguish', 'isolate'}, {'infect', 'give'}, {'prove', 'result'},
 {'intercede', 'describe', 'explain'}, {'affect', 'expose'}, {'restrict', 'suppress'}]

如图：

在 jupyter 中绘制图形的小函数：

def nxplot(G):
    from networkx.drawing.nx_agraph import to_agraph
    A = to_agraph(G)
    A.layout('dot')
    A.draw('/tmp/graph.png')
    from IPython.display import Image
    return Image(filename='/tmp/graph.png')

Answer 2

下一行将 select 仅包含字符串 target_string:

的行

rows = df[df.applymap(lambda element: element ==  target_string).any(axis = 1)]

连接它们并找到独特的元素：

cluster = pd.concat([rows[['v1', 'v2']]], axis = 1).unique()

如果您想查找包含所有单词的聚类，请对所有唯一元素重复此操作。一个低效的例子：

clusters = pd.DataFrame()
for target_string in df.v1.unique():
    rows = df[df.applymap(lambda element: element ==  target_string).any(axis = 1)]
    clusters.append(pd.concat([rows[['v1', 'v2']]], axis = 1).unique())

Pandas dataframe groupby 两列中出现的文本值

Pandas dataframe groupby text value that occurs in two columns

python

nlp

cluster-computing

pandas