从数据框中提取共现数据

Extract co-occurrence data from dataframe

我有这样的东西:

 fromJobtitle         toJobtitle         size
0              CEO                CEO    65
1              CEO     Vice President    23
2              CEO           Employee    56
3   Vice President                CEO   112
4         Employee                CEO    20

我想计算同时出现的次数,以便它结合两次出现(仅显示 2 之间有多少元素)

示例输出:

0              CEO     Vice President   135
1              CEO           Employee    76
2              CEO                CEO    65
import pandas as pd
df = pd.DataFrame({
    'fromJobtitle': ['CEO', 'CEO', 'CEO', 'Vice President', 'Employee'],
    'toJobtitle': ['CEO', 'Vice President', 'Employee', 'CEO', 'CEO'],
    'size': [65, 23, 56, 112, 20]
    })
df['combination'] = df.apply(lambda row: tuple(sorted([
                                                       row['fromJobtitle'], 
                                                       row['toJobtitle']
                                                      ])), axis=1)

然后:

df = df.groupby('combination').sum().reset_index()

结果:

    combination             size
0   (CEO, CEO)              65
1   (CEO, Employee)         76
2   (CEO, Vice President)   135

最后:

df['from'] = df.apply(lambda row: row['combination'][0], axis=1)
df['to'] = df.apply(lambda row: row['combination'][1], axis=1)
df = df.drop('combination', axis=1)
df.head()

结果:

    size    from    to
0   65      CEO     CEO
1   76      CEO     Employee
2   135     CEO     Vice President

尝试:

df.groupby(lambda x: tuple(sorted(df.loc[x, ['fromJobTitle', 'toJobTitle']]))).sum()

结果如下:

                       size
(CEO, CEO)               65
(CEO, Employee)          76
(CEO, Vice President)   135

这是一个不同的解决方案:

首先创建一个按字母顺序组合名称的列

df['titles'] = np.where(df['fromJobtitle']<df['toJobtitle'],df['fromJobtitle']+"|"+df['toJobtitle'],df['toJobtitle']+"|"+df['fromJobtitle'])

0               CEO|CEO
1    CEO|Vice President
2          CEO|Employee
3    CEO|Vice President
4          CEO|Employee
Name: titles, dtype: object

然后按那个名字分组并求和

df_groups = df.groupby('titles').sum().reset_index()
df_groups

titles              size
CEO|CEO             65
CEO|Employee        76
CEO|Vice President  135

然后将合并后的名称拆分成单独的部分

df_groups[['fromJobTitle', 'toJobTitle']] = df_groups['titles'].str.split('|', expand=True)
df_groups

size fromJobTitle   toJobTitle
65    CEO           CEO
76    CEO           Employee
135   CEO           Vice President