pandas 在两列上连接表而不对值进行排序
pandas join tables on two columns without ordering of values
我想实现此处描述的内容:,但仅使用标准 pandas。
我有两个数据框:
拳头
first_employee target_employee relationship
0 Andy Claude 0
1 Andy Frida 20
2 Andy Georgia -10
3 Andy Joan 30
4 Andy Lee -10
5 Andy Pablo -10
6 Andy Vincent 20
7 Claude Frida 0
8 Claude Georgia 90
9 Claude Joan 0
10 Claude Lee 0
11 Claude Pablo 10
12 Claude Vincent 0
13 Frida Georgia 0
14 Frida Joan 0
15 Frida Lee 0
16 Frida Pablo 50
17 Frida Vincent 60
18 Georgia Joan 0
19 Georgia Lee 10
20 Georgia Pablo 0
21 Georgia Vincent 0
22 Joan Lee 70
23 Joan Pablo 0
24 Joan Vincent 10
25 Lee Pablo 0
26 Lee Vincent 0
27 Pablo Vincent -20
第二个:
first_employee target_employee book_count
0 Vincent Frida 2
1 Vincent Pablo 1
2 Andy Claude 1
3 Andy Joan 1
4 Andy Pablo 1
5 Andy Lee 1
6 Andy Frida 1
7 Andy Georgia 1
8 Claude Georgia 3
9 Joan Lee 3
10 Pablo Frida 2
我想加入两个数据框,使我的最终数据框与第一个数据框相同,但它也有 book_count
列和相应的值(如果不可用,则为 NaN)。
我已经写了类似的东西:joined_df = first_df.merge(second_df, on = ['first_employee', 'target_employee'], how = 'outer')
我得到:
first_employee target_employee relationship book_count
0 Andy Claude 0.0 1.0
1 Andy Frida 20.0 1.0
2 Andy Georgia -10.0 1.0
3 Andy Joan 30.0 1.0
4 Andy Lee -10.0 1.0
5 Andy Pablo -10.0 1.0
6 Andy Vincent 20.0 NaN
7 Claude Frida 0.0 NaN
8 Claude Georgia 90.0 3.0
9 Claude Joan 0.0 NaN
10 Claude Lee 0.0 NaN
11 Claude Pablo 10.0 NaN
12 Claude Vincent 0.0 NaN
13 Frida Georgia 0.0 NaN
14 Frida Joan 0.0 NaN
15 Frida Lee 0.0 NaN
16 Frida Pablo 50.0 NaN
17 Frida Vincent 60.0 NaN
18 Georgia Joan 0.0 NaN
19 Georgia Lee 10.0 NaN
20 Georgia Pablo 0.0 NaN
21 Georgia Vincent 0.0 NaN
22 Joan Lee 70.0 3.0
23 Joan Pablo 0.0 NaN
24 Joan Vincent 10.0 NaN
25 Lee Pablo 0.0 NaN
26 Lee Vincent 0.0 NaN
27 Pablo Vincent -20.0 NaN
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
而且有点接近我想要达到的效果。但是,first_employee
和 target_employee
中值的顺序是不相关的,所以如果在第一个数据帧中我有 (Frida,Vincent)
而在第二个数据帧中有 (Vincent, Frida)
,这两个应该合并在一起(重要的是值,而不是列顺序)。
在我生成的数据框中,我得到了三个额外的行:
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
这是我合并的结果,它按列考虑“有序”值以进行连接:这 3 个额外的行应该合并到已经可用的对 (Frida, Vincent)
(Pablo, Vincent)
和 (Frida, Pablo)
.
有没有办法只使用标准 pandas
函数来做到这一点? (我开头引用的问题用的是sqldf
)
我相信这就是您要找的。使用 np.sort
将更改每行前两列的顺序,使其按字母顺序排列,从而使合并工作正常进行。
cols = ['first_employee','target_employee']
df[cols] = np.sort(df[cols].to_numpy(),axis=1)
df2[cols] = np.sort(df2[cols].to_numpy(),axis=1)
ndf = pd.merge(df,df2,on = cols,how='left')
创建一个 key
作为第一个和目标员工的排序元组,然后在其上合并:
create_key = lambda x: tuple(sorted([x['first_employee'], x['target_employee']]))
out = pd.merge(df1.assign(_key=df1.apply(create_key, axis=1)),
df2.assign(_key=df2.apply(create_key, axis=1)),
on='_key', suffixes=('', '_key'), how='outer') \
.loc[:, lambda x: ~x.columns.str.endswith('_key')]
print(out)
# Output:
first_employee target_employee relationship book_count
0 Andy Claude 0 1.0
1 Andy Frida 20 1.0
2 Andy Georgia -10 1.0
3 Andy Joan 30 1.0
4 Andy Lee -10 1.0
5 Andy Pablo -10 1.0
6 Andy Vincent 20 NaN
7 Claude Frida 0 NaN
8 Claude Georgia 90 3.0
9 Claude Joan 0 NaN
10 Claude Lee 0 NaN
11 Claude Pablo 10 NaN
12 Claude Vincent 0 NaN
13 Frida Georgia 0 NaN
14 Frida Joan 0 NaN
15 Frida Lee 0 NaN
16 Frida Pablo 50 2.0
17 Frida Vincent 60 2.0
18 Georgia Joan 0 NaN
19 Georgia Lee 10 NaN
20 Georgia Pablo 0 NaN
21 Georgia Vincent 0 NaN
22 Joan Lee 70 3.0
23 Joan Pablo 0 NaN
24 Joan Vincent 10 NaN
25 Lee Pablo 0 NaN
26 Lee Vincent 0 NaN
27 Pablo Vincent -20 1.0
我想实现此处描述的内容:
我有两个数据框: 拳头
first_employee target_employee relationship
0 Andy Claude 0
1 Andy Frida 20
2 Andy Georgia -10
3 Andy Joan 30
4 Andy Lee -10
5 Andy Pablo -10
6 Andy Vincent 20
7 Claude Frida 0
8 Claude Georgia 90
9 Claude Joan 0
10 Claude Lee 0
11 Claude Pablo 10
12 Claude Vincent 0
13 Frida Georgia 0
14 Frida Joan 0
15 Frida Lee 0
16 Frida Pablo 50
17 Frida Vincent 60
18 Georgia Joan 0
19 Georgia Lee 10
20 Georgia Pablo 0
21 Georgia Vincent 0
22 Joan Lee 70
23 Joan Pablo 0
24 Joan Vincent 10
25 Lee Pablo 0
26 Lee Vincent 0
27 Pablo Vincent -20
第二个:
first_employee target_employee book_count
0 Vincent Frida 2
1 Vincent Pablo 1
2 Andy Claude 1
3 Andy Joan 1
4 Andy Pablo 1
5 Andy Lee 1
6 Andy Frida 1
7 Andy Georgia 1
8 Claude Georgia 3
9 Joan Lee 3
10 Pablo Frida 2
我想加入两个数据框,使我的最终数据框与第一个数据框相同,但它也有 book_count
列和相应的值(如果不可用,则为 NaN)。
我已经写了类似的东西:joined_df = first_df.merge(second_df, on = ['first_employee', 'target_employee'], how = 'outer')
我得到:
first_employee target_employee relationship book_count
0 Andy Claude 0.0 1.0
1 Andy Frida 20.0 1.0
2 Andy Georgia -10.0 1.0
3 Andy Joan 30.0 1.0
4 Andy Lee -10.0 1.0
5 Andy Pablo -10.0 1.0
6 Andy Vincent 20.0 NaN
7 Claude Frida 0.0 NaN
8 Claude Georgia 90.0 3.0
9 Claude Joan 0.0 NaN
10 Claude Lee 0.0 NaN
11 Claude Pablo 10.0 NaN
12 Claude Vincent 0.0 NaN
13 Frida Georgia 0.0 NaN
14 Frida Joan 0.0 NaN
15 Frida Lee 0.0 NaN
16 Frida Pablo 50.0 NaN
17 Frida Vincent 60.0 NaN
18 Georgia Joan 0.0 NaN
19 Georgia Lee 10.0 NaN
20 Georgia Pablo 0.0 NaN
21 Georgia Vincent 0.0 NaN
22 Joan Lee 70.0 3.0
23 Joan Pablo 0.0 NaN
24 Joan Vincent 10.0 NaN
25 Lee Pablo 0.0 NaN
26 Lee Vincent 0.0 NaN
27 Pablo Vincent -20.0 NaN
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
而且有点接近我想要达到的效果。但是,first_employee
和 target_employee
中值的顺序是不相关的,所以如果在第一个数据帧中我有 (Frida,Vincent)
而在第二个数据帧中有 (Vincent, Frida)
,这两个应该合并在一起(重要的是值,而不是列顺序)。
在我生成的数据框中,我得到了三个额外的行:
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
这是我合并的结果,它按列考虑“有序”值以进行连接:这 3 个额外的行应该合并到已经可用的对 (Frida, Vincent)
(Pablo, Vincent)
和 (Frida, Pablo)
.
有没有办法只使用标准 pandas
函数来做到这一点? (我开头引用的问题用的是sqldf
)
我相信这就是您要找的。使用 np.sort
将更改每行前两列的顺序,使其按字母顺序排列,从而使合并工作正常进行。
cols = ['first_employee','target_employee']
df[cols] = np.sort(df[cols].to_numpy(),axis=1)
df2[cols] = np.sort(df2[cols].to_numpy(),axis=1)
ndf = pd.merge(df,df2,on = cols,how='left')
创建一个 key
作为第一个和目标员工的排序元组,然后在其上合并:
create_key = lambda x: tuple(sorted([x['first_employee'], x['target_employee']]))
out = pd.merge(df1.assign(_key=df1.apply(create_key, axis=1)),
df2.assign(_key=df2.apply(create_key, axis=1)),
on='_key', suffixes=('', '_key'), how='outer') \
.loc[:, lambda x: ~x.columns.str.endswith('_key')]
print(out)
# Output:
first_employee target_employee relationship book_count
0 Andy Claude 0 1.0
1 Andy Frida 20 1.0
2 Andy Georgia -10 1.0
3 Andy Joan 30 1.0
4 Andy Lee -10 1.0
5 Andy Pablo -10 1.0
6 Andy Vincent 20 NaN
7 Claude Frida 0 NaN
8 Claude Georgia 90 3.0
9 Claude Joan 0 NaN
10 Claude Lee 0 NaN
11 Claude Pablo 10 NaN
12 Claude Vincent 0 NaN
13 Frida Georgia 0 NaN
14 Frida Joan 0 NaN
15 Frida Lee 0 NaN
16 Frida Pablo 50 2.0
17 Frida Vincent 60 2.0
18 Georgia Joan 0 NaN
19 Georgia Lee 10 NaN
20 Georgia Pablo 0 NaN
21 Georgia Vincent 0 NaN
22 Joan Lee 70 3.0
23 Joan Pablo 0 NaN
24 Joan Vincent 10 NaN
25 Lee Pablo 0 NaN
26 Lee Vincent 0 NaN
27 Pablo Vincent -20 1.0