如何在具有重复项的列上合并两个 DataFrame,并输出没有重复项的行
How do I merge two DataFrame on a column with duplicates, and output without duplicates row
我有两个数据框 df1 和 df2,如下所示:
df1:
公司
职业
0
一个
管理员
1
B
工程师
2
C
工程师
3
D
帐号
4
E
管理员
5
F
工程师
df2:
职业
描述
0
帐号
余额
1
工程师
数据库
2
管理员
家务
3
管理员
通话
4
工程师
前端
5
工程师
后端
我想要的:
公司
职业
描述
0
一个
管理员
家务
1
B
工程师
数据库
2
C
工程师
前端
3
D
账户
余额
4
E
管理员
通话
5
F
工程师
后端
我试过 pd.merge(df1,df2,how="inner")
,但总是出现重复行:
公司
职业
描述
0
一个
管理员
家务
1
一个
管理员
通话
2
E
管理员
家务
3
E
管理员
通话
4
B
工程师
数据库
5
B
工程师
前端
6
B
工程师
后端
7
C
工程师
数据库
8
C
工程师
前端
9
C
工程师
后端
10
F
工程师
数据库
11
F
工程师
前端
12
F
工程师
后端
13
D
帐号
余额
代码:
import pandas as pd
df1 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"]})
df2 = pd.DataFrame({"occupation":["Account","Engineer","Administrator","Administrator","Engineer","Engineer"],"description":["balance","database","chores","calling","frontend","backendend"]})
df3 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"],"description":["chores","database","balance","frontend","calling","backendend"]})
df4 = pd.merge(df1,df2,how="inner")
display(df1)
display(df2)
display(df3)
display(df4)
让我们尝试用 groupby cumcount
创建一个键列来跟踪位置,然后在 occupation
和 key
上合并:
df1['key'] = df1.groupby('occupation').cumcount()
df2['key'] = df2.groupby('occupation').cumcount()
df4 = df1.merge(df2, on=['occupation', 'key']).drop('key', axis=1)
df4
:
company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend
df4
不丢弃 key
:
company occupation key description
0 A Administrator 0 chores
1 B Engineer 0 database
2 C Engineer 1 frontend
3 D Account 0 balance
4 E Administrator 1 calling
5 F Engineer 2 backendend
也可以在不影响 df1
或 df2
的情况下直接合并系列:
df4 = df1.merge(
df2,
left_on=['occupation', df1.groupby('occupation').cumcount()],
right_on=['occupation', df2.groupby('occupation').cumcount()]
).drop('key_1', axis=1)
df4
:
company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend
您可以合成所需的部分合并条件。 职业在数据框中的位置。
df1 = pd.DataFrame({'company': ['A', 'B', 'C', 'D', 'E', 'F'],
'occupation': ['Administrator','Engineer','Engineer','Account','Administrator','Engineer']})
df2 = pd.DataFrame({'occupation': ['Account','Engineer','Administrator','Administrator','Engineer','Engineer'],
'description': ['balance','database','chores','calling','frontend','backendend']})
df1.assign(oid=df1.groupby("occupation", as_index=False).cumcount()).merge(
df2.assign(oid=df2.groupby("occupation", as_index=False).cumcount()),
on=["occupation", "oid"],
)
company
occupation
oid
description
0
A
Administrator
0
chores
1
B
Engineer
0
database
2
C
Engineer
1
frontend
3
D
Account
0
balance
4
E
Administrator
1
calling
5
F
Engineer
2
backendend
我有两个数据框 df1 和 df2,如下所示:
df1:
公司 | 职业 | |
---|---|---|
0 | 一个 | 管理员 |
1 | B | 工程师 |
2 | C | 工程师 |
3 | D | 帐号 |
4 | E | 管理员 |
5 | F | 工程师 |
df2:
职业 | 描述 | |
---|---|---|
0 | 帐号 | 余额 |
1 | 工程师 | 数据库 |
2 | 管理员 | 家务 |
3 | 管理员 | 通话 |
4 | 工程师 | 前端 |
5 | 工程师 | 后端 |
我想要的:
公司 | 职业 | 描述 | |
---|---|---|---|
0 | 一个 | 管理员 | 家务 |
1 | B | 工程师 | 数据库 |
2 | C | 工程师 | 前端 |
3 | D | 账户 | 余额 |
4 | E | 管理员 | 通话 |
5 | F | 工程师 | 后端 |
我试过 pd.merge(df1,df2,how="inner")
,但总是出现重复行:
公司 | 职业 | 描述 | |
---|---|---|---|
0 | 一个 | 管理员 | 家务 |
1 | 一个 | 管理员 | 通话 |
2 | E | 管理员 | 家务 |
3 | E | 管理员 | 通话 |
4 | B | 工程师 | 数据库 |
5 | B | 工程师 | 前端 |
6 | B | 工程师 | 后端 |
7 | C | 工程师 | 数据库 |
8 | C | 工程师 | 前端 |
9 | C | 工程师 | 后端 |
10 | F | 工程师 | 数据库 |
11 | F | 工程师 | 前端 |
12 | F | 工程师 | 后端 |
13 | D | 帐号 | 余额 |
代码:
import pandas as pd
df1 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"]})
df2 = pd.DataFrame({"occupation":["Account","Engineer","Administrator","Administrator","Engineer","Engineer"],"description":["balance","database","chores","calling","frontend","backendend"]})
df3 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"],"description":["chores","database","balance","frontend","calling","backendend"]})
df4 = pd.merge(df1,df2,how="inner")
display(df1)
display(df2)
display(df3)
display(df4)
让我们尝试用 groupby cumcount
创建一个键列来跟踪位置,然后在 occupation
和 key
上合并:
df1['key'] = df1.groupby('occupation').cumcount()
df2['key'] = df2.groupby('occupation').cumcount()
df4 = df1.merge(df2, on=['occupation', 'key']).drop('key', axis=1)
df4
:
company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend
df4
不丢弃 key
:
company occupation key description
0 A Administrator 0 chores
1 B Engineer 0 database
2 C Engineer 1 frontend
3 D Account 0 balance
4 E Administrator 1 calling
5 F Engineer 2 backendend
也可以在不影响 df1
或 df2
的情况下直接合并系列:
df4 = df1.merge(
df2,
left_on=['occupation', df1.groupby('occupation').cumcount()],
right_on=['occupation', df2.groupby('occupation').cumcount()]
).drop('key_1', axis=1)
df4
:
company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend
您可以合成所需的部分合并条件。 职业在数据框中的位置。
df1 = pd.DataFrame({'company': ['A', 'B', 'C', 'D', 'E', 'F'],
'occupation': ['Administrator','Engineer','Engineer','Account','Administrator','Engineer']})
df2 = pd.DataFrame({'occupation': ['Account','Engineer','Administrator','Administrator','Engineer','Engineer'],
'description': ['balance','database','chores','calling','frontend','backendend']})
df1.assign(oid=df1.groupby("occupation", as_index=False).cumcount()).merge(
df2.assign(oid=df2.groupby("occupation", as_index=False).cumcount()),
on=["occupation", "oid"],
)
company | occupation | oid | description | |
---|---|---|---|---|
0 | A | Administrator | 0 | chores |
1 | B | Engineer | 0 | database |
2 | C | Engineer | 1 | frontend |
3 | D | Account | 0 | balance |
4 | E | Administrator | 1 | calling |
5 | F | Engineer | 2 | backendend |