如何在 pandas 中按一列或另一列分组

Question

我有一个 table 喜欢：

    col1    col2
0   1       a
1   2       b
2   2       c
3   3       c
4   4       d

如果行在 col1 或 col2 中具有匹配值，我希望将它们组合在一起。也就是说，我想要这样的东西：

> (
    df
    .groupby(set('col1', 'col2'))  # Made-up syntax
    .ngroup())
0  0
1  1
2  1
3  1
4  2

有没有办法用 pandas 做到这一点？

Answer 1

这并不容易用 pandas 实现。实际上，当第二组中的两个项目连接时，两个较远的组可以连接。

您可以使用图论来解决这个问题。使用由两个（或更多）组形成的边找到连接的组件。 python 库是 networkx:

import networkx as nx

g1 = df.groupby('col1').ngroup()
g2 = 'a'+df.groupby('col2').ngroup().astype(str)

# make graph and get connected components to form a mapping dictionary
G = nx.from_edgelist(zip(g1, g2))
d = {k:v for v,s in enumerate(nx.connected_components(G)) for k in s}

# find common group
group = g1.map(d)

df.groupby(group).ngroup()

输出：

0    0
1    1
2    1
3    1
4    2
dtype: int64

图表：

如何在 pandas 中按一列或另一列分组

How to group by one column or another in pandas

pandas

pandas-groupby