python 中的关联组
Associative group in python
预处理后,我得到了一个包含列 'timestamp'、'group'、'person1'、'person2' 的最终数据框。我正在尝试弄清楚如何对我的要求进行编码,或者想知道是否可以使用 python。
我要提取的是每个组中的组。例如:在G0组中,A与B相会,B与C相会,A与D相会。表示ABCD在组内组成一个组。每个组内可以有多个组(例如 G1 组)。我怎样才能做到这一点?我可以应用什么逻辑或代码来提取它?找了很多都没有用..
dataframe 样本和预期输出的图片是:
示例数据:
df = pd.DataFrame(
{
"timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
"group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
"person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
"person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
}
)
这当然很有趣,尤其是考虑到当前的大流行。听起来你需要 graph theory to help you. Python can do this through dictionaries and custom classes as this tutorial describes
此外,this out-of-date Python documentation 可能也有一些帮助。您需要根据您的要求调整其 find_all_graphs()
功能。
您可以使用 networkx 库 graph theory and connected components:
import networkx as nx
import pandas as pd
df = pd.DataFrame(
{
"timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
"group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
"person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
"person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
}
)
def f(x):
G = nx.from_pandas_edgelist(x, 'person1', 'person2')
l = x.apply(lambda n: ''.join(nx.node_connected_component(G, n['person1'])), axis=1)
return l
df['subgroup'] = df.groupby('group').apply(f).to_numpy()
df
输出:
timestamp group person1 person2 subgroup
0 25-06-2020 09:29 G0 A B DACB
1 25-06-2020 09:29 G0 A B DACB
2 25-06-2020 09:31 G0 B C DACB
3 25-06-2020 09:32 G0 A D DACB
4 25-06-2020 09:33 G0 X Y YX
5 25-06-2020 09:33 G0 Z N NZ
6 25-06-2020 11:17 G1 A B ACB
7 25-06-2020 11:17 G1 B C ACB
8 25-06-2020 11:17 G1 L M ML
9 25-06-2020 11:17 G1 X Y ZYX
10 25-06-2020 12:29 G1 Y Z ZYX
11 25-06-2020 12:29 G2 L M ML
12 25-06-2020 12:30 G2 N O ONP
13 25-06-2020 12:30 G2 O P ONP
Groupby 子组:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.groupby('subgroup')['timestamp'].agg(['min', 'max'])
输出:
min max
subgroup
ACB 2020-06-25 11:17:00 2020-06-25 11:17:00
DACB 2020-06-25 09:29:00 2020-06-25 09:32:00
ML 2020-06-25 11:17:00 2020-06-25 12:29:00
NZ 2020-06-25 09:33:00 2020-06-25 09:33:00
ONP 2020-06-25 12:30:00 2020-06-25 12:30:00
YX 2020-06-25 09:33:00 2020-06-25 09:33:00
ZYX 2020-06-25 11:17:00 2020-06-25 12:29:00
预处理后,我得到了一个包含列 'timestamp'、'group'、'person1'、'person2' 的最终数据框。我正在尝试弄清楚如何对我的要求进行编码,或者想知道是否可以使用 python。 我要提取的是每个组中的组。例如:在G0组中,A与B相会,B与C相会,A与D相会。表示ABCD在组内组成一个组。每个组内可以有多个组(例如 G1 组)。我怎样才能做到这一点?我可以应用什么逻辑或代码来提取它?找了很多都没有用..
dataframe 样本和预期输出的图片是:
示例数据:
df = pd.DataFrame(
{
"timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
"group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
"person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
"person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
}
)
这当然很有趣,尤其是考虑到当前的大流行。听起来你需要 graph theory to help you. Python can do this through dictionaries and custom classes as this tutorial describes
此外,this out-of-date Python documentation 可能也有一些帮助。您需要根据您的要求调整其 find_all_graphs()
功能。
您可以使用 networkx 库 graph theory and connected components:
import networkx as nx
import pandas as pd
df = pd.DataFrame(
{
"timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
"group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
"person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
"person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
}
)
def f(x):
G = nx.from_pandas_edgelist(x, 'person1', 'person2')
l = x.apply(lambda n: ''.join(nx.node_connected_component(G, n['person1'])), axis=1)
return l
df['subgroup'] = df.groupby('group').apply(f).to_numpy()
df
输出:
timestamp group person1 person2 subgroup
0 25-06-2020 09:29 G0 A B DACB
1 25-06-2020 09:29 G0 A B DACB
2 25-06-2020 09:31 G0 B C DACB
3 25-06-2020 09:32 G0 A D DACB
4 25-06-2020 09:33 G0 X Y YX
5 25-06-2020 09:33 G0 Z N NZ
6 25-06-2020 11:17 G1 A B ACB
7 25-06-2020 11:17 G1 B C ACB
8 25-06-2020 11:17 G1 L M ML
9 25-06-2020 11:17 G1 X Y ZYX
10 25-06-2020 12:29 G1 Y Z ZYX
11 25-06-2020 12:29 G2 L M ML
12 25-06-2020 12:30 G2 N O ONP
13 25-06-2020 12:30 G2 O P ONP
Groupby 子组:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.groupby('subgroup')['timestamp'].agg(['min', 'max'])
输出:
min max
subgroup
ACB 2020-06-25 11:17:00 2020-06-25 11:17:00
DACB 2020-06-25 09:29:00 2020-06-25 09:32:00
ML 2020-06-25 11:17:00 2020-06-25 12:29:00
NZ 2020-06-25 09:33:00 2020-06-25 09:33:00
ONP 2020-06-25 12:30:00 2020-06-25 12:30:00
YX 2020-06-25 09:33:00 2020-06-25 09:33:00
ZYX 2020-06-25 11:17:00 2020-06-25 12:29:00