如何在具有重复项的列上合并两个 DataFrame,并输出没有重复项的行

How do I merge two DataFrame on a column with duplicates, and output without duplicates row

我有两个数据框 df1 和 df2,如下所示:

df1:

公司 职业
0 一个 管理员
1 B 工程师
2 C 工程师
3 D 帐号
4 E 管理员
5 F 工程师

df2:

职业 描述
0 帐号 余额
1 工程师 数据库
2 管理员 家务
3 管理员 通话
4 工程师 前端
5 工程师 后端

我想要的:

公司 职业 描述
0 一个 管理员 家务
1 B 工程师 数据库
2 C 工程师 前端
3 D 账户 余额
4 E 管理员 通话
5 F 工程师 后端

我试过 pd.merge(df1,df2,how="inner"),但总是出现重复行:

公司 职业 描述
0 一个 管理员 家务
1 一个 管理员 通话
2 E 管理员 家务
3 E 管理员 通话
4 B 工程师 数据库
5 B 工程师 前端
6 B 工程师 后端
7 C 工程师 数据库
8 C 工程师 前端
9 C 工程师 后端
10 F 工程师 数据库
11 F 工程师 前端
12 F 工程师 后端
13 D 帐号 余额

代码:

import pandas as pd
df1 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"]})
df2 = pd.DataFrame({"occupation":["Account","Engineer","Administrator","Administrator","Engineer","Engineer"],"description":["balance","database","chores","calling","frontend","backendend"]})
df3 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"],"description":["chores","database","balance","frontend","calling","backendend"]})
df4 = pd.merge(df1,df2,how="inner")
display(df1)
display(df2)
display(df3)
display(df4)

让我们尝试用 groupby cumcount 创建一个键列来跟踪位置,然后在 occupationkey 上合并:

df1['key'] = df1.groupby('occupation').cumcount()
df2['key'] = df2.groupby('occupation').cumcount()
df4 = df1.merge(df2, on=['occupation', 'key']).drop('key', axis=1)

df4:

  company     occupation description
0       A  Administrator      chores
1       B       Engineer    database
2       C       Engineer    frontend
3       D        Account     balance
4       E  Administrator     calling
5       F       Engineer  backendend

df4 不丢弃 key:

  company     occupation  key description
0       A  Administrator    0      chores
1       B       Engineer    0    database
2       C       Engineer    1    frontend
3       D        Account    0     balance
4       E  Administrator    1     calling
5       F       Engineer    2  backendend

也可以在不影响 df1df2 的情况下直接合并系列:

df4 = df1.merge(
    df2,
    left_on=['occupation', df1.groupby('occupation').cumcount()],
    right_on=['occupation', df2.groupby('occupation').cumcount()]
).drop('key_1', axis=1)

df4:

  company     occupation description
0       A  Administrator      chores
1       B       Engineer    database
2       C       Engineer    frontend
3       D        Account     balance
4       E  Administrator     calling
5       F       Engineer  backendend

您可以合成所需的部分合并条件。 职业在数据框中的位置。

df1 = pd.DataFrame({'company': ['A', 'B', 'C', 'D', 'E', 'F'],
 'occupation': ['Administrator','Engineer','Engineer','Account','Administrator','Engineer']})

df2 = pd.DataFrame({'occupation': ['Account','Engineer','Administrator','Administrator','Engineer','Engineer'],
 'description': ['balance','database','chores','calling','frontend','backendend']})

df1.assign(oid=df1.groupby("occupation", as_index=False).cumcount()).merge(
    df2.assign(oid=df2.groupby("occupation", as_index=False).cumcount()),
    on=["occupation", "oid"],
)

company occupation oid description
0 A Administrator 0 chores
1 B Engineer 0 database
2 C Engineer 1 frontend
3 D Account 0 balance
4 E Administrator 1 calling
5 F Engineer 2 backendend