从数据框中提取共现数据
Extract co-occurrence data from dataframe
我有这样的东西:
fromJobtitle toJobtitle size
0 CEO CEO 65
1 CEO Vice President 23
2 CEO Employee 56
3 Vice President CEO 112
4 Employee CEO 20
我想计算同时出现的次数,以便它结合两次出现(仅显示 2 之间有多少元素)
示例输出:
0 CEO Vice President 135
1 CEO Employee 76
2 CEO CEO 65
import pandas as pd
df = pd.DataFrame({
'fromJobtitle': ['CEO', 'CEO', 'CEO', 'Vice President', 'Employee'],
'toJobtitle': ['CEO', 'Vice President', 'Employee', 'CEO', 'CEO'],
'size': [65, 23, 56, 112, 20]
})
df['combination'] = df.apply(lambda row: tuple(sorted([
row['fromJobtitle'],
row['toJobtitle']
])), axis=1)
然后:
df = df.groupby('combination').sum().reset_index()
结果:
combination size
0 (CEO, CEO) 65
1 (CEO, Employee) 76
2 (CEO, Vice President) 135
最后:
df['from'] = df.apply(lambda row: row['combination'][0], axis=1)
df['to'] = df.apply(lambda row: row['combination'][1], axis=1)
df = df.drop('combination', axis=1)
df.head()
结果:
size from to
0 65 CEO CEO
1 76 CEO Employee
2 135 CEO Vice President
尝试:
df.groupby(lambda x: tuple(sorted(df.loc[x, ['fromJobTitle', 'toJobTitle']]))).sum()
结果如下:
size
(CEO, CEO) 65
(CEO, Employee) 76
(CEO, Vice President) 135
这是一个不同的解决方案:
首先创建一个按字母顺序组合名称的列
df['titles'] = np.where(df['fromJobtitle']<df['toJobtitle'],df['fromJobtitle']+"|"+df['toJobtitle'],df['toJobtitle']+"|"+df['fromJobtitle'])
0 CEO|CEO
1 CEO|Vice President
2 CEO|Employee
3 CEO|Vice President
4 CEO|Employee
Name: titles, dtype: object
然后按那个名字分组并求和
df_groups = df.groupby('titles').sum().reset_index()
df_groups
titles size
CEO|CEO 65
CEO|Employee 76
CEO|Vice President 135
然后将合并后的名称拆分成单独的部分
df_groups[['fromJobTitle', 'toJobTitle']] = df_groups['titles'].str.split('|', expand=True)
df_groups
size fromJobTitle toJobTitle
65 CEO CEO
76 CEO Employee
135 CEO Vice President
我有这样的东西:
fromJobtitle toJobtitle size
0 CEO CEO 65
1 CEO Vice President 23
2 CEO Employee 56
3 Vice President CEO 112
4 Employee CEO 20
我想计算同时出现的次数,以便它结合两次出现(仅显示 2 之间有多少元素)
示例输出:
0 CEO Vice President 135
1 CEO Employee 76
2 CEO CEO 65
import pandas as pd
df = pd.DataFrame({
'fromJobtitle': ['CEO', 'CEO', 'CEO', 'Vice President', 'Employee'],
'toJobtitle': ['CEO', 'Vice President', 'Employee', 'CEO', 'CEO'],
'size': [65, 23, 56, 112, 20]
})
df['combination'] = df.apply(lambda row: tuple(sorted([
row['fromJobtitle'],
row['toJobtitle']
])), axis=1)
然后:
df = df.groupby('combination').sum().reset_index()
结果:
combination size
0 (CEO, CEO) 65
1 (CEO, Employee) 76
2 (CEO, Vice President) 135
最后:
df['from'] = df.apply(lambda row: row['combination'][0], axis=1)
df['to'] = df.apply(lambda row: row['combination'][1], axis=1)
df = df.drop('combination', axis=1)
df.head()
结果:
size from to
0 65 CEO CEO
1 76 CEO Employee
2 135 CEO Vice President
尝试:
df.groupby(lambda x: tuple(sorted(df.loc[x, ['fromJobTitle', 'toJobTitle']]))).sum()
结果如下:
size
(CEO, CEO) 65
(CEO, Employee) 76
(CEO, Vice President) 135
这是一个不同的解决方案:
首先创建一个按字母顺序组合名称的列
df['titles'] = np.where(df['fromJobtitle']<df['toJobtitle'],df['fromJobtitle']+"|"+df['toJobtitle'],df['toJobtitle']+"|"+df['fromJobtitle'])
0 CEO|CEO
1 CEO|Vice President
2 CEO|Employee
3 CEO|Vice President
4 CEO|Employee
Name: titles, dtype: object
然后按那个名字分组并求和
df_groups = df.groupby('titles').sum().reset_index()
df_groups
titles size
CEO|CEO 65
CEO|Employee 76
CEO|Vice President 135
然后将合并后的名称拆分成单独的部分
df_groups[['fromJobTitle', 'toJobTitle']] = df_groups['titles'].str.split('|', expand=True)
df_groups
size fromJobTitle toJobTitle
65 CEO CEO
76 CEO Employee
135 CEO Vice President