python 中的关联组

Question

预处理后，我得到了一个包含列 'timestamp'、'group'、'person1'、'person2' 的最终数据框。我正在尝试弄清楚如何对我的要求进行编码，或者想知道是否可以使用 python。我要提取的是每个组中的组。例如：在G0组中，A与B相会，B与C相会，A与D相会。表示ABCD在组内组成一个组。每个组内可以有多个组（例如 G1 组）。我怎样才能做到这一点？我可以应用什么逻辑或代码来提取它？找了很多都没有用..

dataframe 样本和预期输出的图片是：

示例数据：

df = pd.DataFrame(
    {
        "timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
        "group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
        "person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
        "person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
    }
)

Answer 1

这当然很有趣，尤其是考虑到当前的大流行。听起来你需要 graph theory to help you. Python can do this through dictionaries and custom classes as this tutorial describes

此外，this out-of-date Python documentation 可能也有一些帮助。您需要根据您的要求调整其 find_all_graphs() 功能。

Answer 2

您可以使用 networkx 库 graph theory and connected components:

import networkx as nx
import pandas as pd

df = pd.DataFrame(
    {
        "timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
        "group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
        "person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
        "person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
    }
)

def f(x):
    G = nx.from_pandas_edgelist(x, 'person1', 'person2')
    l = x.apply(lambda n: ''.join(nx.node_connected_component(G, n['person1'])), axis=1)
    return l

df['subgroup'] = df.groupby('group').apply(f).to_numpy()
df

输出：

           timestamp group person1 person2 subgroup
0   25-06-2020 09:29    G0       A       B     DACB
1   25-06-2020 09:29    G0       A       B     DACB
2   25-06-2020 09:31    G0       B       C     DACB
3   25-06-2020 09:32    G0       A       D     DACB
4   25-06-2020 09:33    G0       X       Y       YX
5   25-06-2020 09:33    G0       Z       N       NZ
6   25-06-2020 11:17    G1       A       B      ACB
7   25-06-2020 11:17    G1       B       C      ACB
8   25-06-2020 11:17    G1       L       M       ML
9   25-06-2020 11:17    G1       X       Y      ZYX
10  25-06-2020 12:29    G1       Y       Z      ZYX
11  25-06-2020 12:29    G2       L       M       ML
12  25-06-2020 12:30    G2       N       O      ONP
13  25-06-2020 12:30    G2       O       P      ONP

Groupby 子组：

df['timestamp'] = pd.to_datetime(df['timestamp'])
df.groupby('subgroup')['timestamp'].agg(['min', 'max'])

输出：

                         min                 max
subgroup                                        
ACB      2020-06-25 11:17:00 2020-06-25 11:17:00
DACB     2020-06-25 09:29:00 2020-06-25 09:32:00
ML       2020-06-25 11:17:00 2020-06-25 12:29:00
NZ       2020-06-25 09:33:00 2020-06-25 09:33:00
ONP      2020-06-25 12:30:00 2020-06-25 12:30:00
YX       2020-06-25 09:33:00 2020-06-25 09:33:00
ZYX      2020-06-25 11:17:00 2020-06-25 12:29:00

python 中的关联组

Associative group in python

python

associations

dataframe

pandas