合并多列 left_on 和 right_on 上的不相等数据框
merging unequal dataframes on multiple columns left_on & right_on
我有以下 df:
df1 = pd.DataFrame({'name':['Sara', 'John', 'Christine'],
'email': ['sara@example.com', 'john@example.com', 'Christine@example.com']})
df1:
name email
0 Sara sara@example.com
1 John john@example.com
2 Christine Christine@example.com
和以下 df2 以获得更多客户的电子邮件信息:
df2 = pd.DataFrame({'email_id':['sara@example.com', np.nan , 'flower@example8.com'],
'alternate email': ['sara@example.com', 'john.walker@example.com' , 'Christine33@example.com'],
'alternate email2': ['sara13@example.com', 'john@example.com', 'Christine@example.com']})
df2:
email_id alternate email alternate email2
0 sara@example.com sara@example.com sara13@example.com
1 NaN john.walker@example.com john@example.com
2 flower@example8.com Christine33@example.com Christine@example.com
现在我想合并左侧的两个数据帧 (df1) 并在 df2 的多个列上进行匹配
如果我使用 left_on
和 right_on
合并:
df1.merge(df2, left_on='email', right_on='email_id', how='left')
那么只会匹配一位客户:
name email email_id alternate email alternate email2
0 Sara sara@example.com sara@example.com sara@example.com sara13@example.com
1 John john@example.com NaN NaN NaN
2 Christine Christine@example.com NaN NaN NaN
不知道有没有合并左边一列右边多列df的方法。我可以一个一个做多个匹配,但是不实用!
编辑:
预期输出:
name email email_id alternate email alternate email2
0 Sara sara@example.com sara@example.com sara@example.com sara13@example.com
1 John john@example.com NaN john.walker@example.com john@example.com
2 Christine Christine@example.com flower@example8.com Christine33@example.com Christine@example.com
就像使用下面的代码:
df1.merge(df2, left_on='email', right_on=['email_id','alternate email','alternate email2'], how='left')
但是会出现dataframes长度不等的错误:
len(right_on) must equal len(left_on)
尝试:
想法是在列的每一列上合并 df1(存在于 df2 中)
cols=['email_id', 'alternate email', 'alternate email2']
out=(pd.concat([df1.merge(df2[x], left_on='email', right_on=x, how='left') for x in cols])
.dropna(subset=cols,how='all'))
out[cols]=out[cols].fillna(df2[cols])
out=out.drop_duplicates()
现在如果你打印 out
你会得到你想要的输出
另一种方式:
indexes = df2.unstack().reset_index(level=0, drop=True) \
.rename('email').drop_duplicates().dropna() \
.reset_index()
df1 = df1.merge(indexes, how='left')
df1 = df1.merge(df2, left_on='index', right_index=True).drop(columns='index')
>>> df1
name email email_id alternate email alternate email2
0 Sara sara@example.com sara@example.com sara@example.com sara13@example.com
1 John john@example.com NaN john.walker@example.com john@example.com
2 Christine Christine@example.com flower@example8.com Christine33@example.com Christine@example.com
我有以下 df:
df1 = pd.DataFrame({'name':['Sara', 'John', 'Christine'],
'email': ['sara@example.com', 'john@example.com', 'Christine@example.com']})
df1:
name email
0 Sara sara@example.com
1 John john@example.com
2 Christine Christine@example.com
和以下 df2 以获得更多客户的电子邮件信息:
df2 = pd.DataFrame({'email_id':['sara@example.com', np.nan , 'flower@example8.com'],
'alternate email': ['sara@example.com', 'john.walker@example.com' , 'Christine33@example.com'],
'alternate email2': ['sara13@example.com', 'john@example.com', 'Christine@example.com']})
df2:
email_id alternate email alternate email2
0 sara@example.com sara@example.com sara13@example.com
1 NaN john.walker@example.com john@example.com
2 flower@example8.com Christine33@example.com Christine@example.com
现在我想合并左侧的两个数据帧 (df1) 并在 df2 的多个列上进行匹配
如果我使用 left_on
和 right_on
合并:
df1.merge(df2, left_on='email', right_on='email_id', how='left')
那么只会匹配一位客户:
name email email_id alternate email alternate email2
0 Sara sara@example.com sara@example.com sara@example.com sara13@example.com
1 John john@example.com NaN NaN NaN
2 Christine Christine@example.com NaN NaN NaN
不知道有没有合并左边一列右边多列df的方法。我可以一个一个做多个匹配,但是不实用!
编辑:
预期输出:
name email email_id alternate email alternate email2
0 Sara sara@example.com sara@example.com sara@example.com sara13@example.com
1 John john@example.com NaN john.walker@example.com john@example.com
2 Christine Christine@example.com flower@example8.com Christine33@example.com Christine@example.com
就像使用下面的代码:
df1.merge(df2, left_on='email', right_on=['email_id','alternate email','alternate email2'], how='left')
但是会出现dataframes长度不等的错误:
len(right_on) must equal len(left_on)
尝试:
想法是在列的每一列上合并 df1(存在于 df2 中)
cols=['email_id', 'alternate email', 'alternate email2']
out=(pd.concat([df1.merge(df2[x], left_on='email', right_on=x, how='left') for x in cols])
.dropna(subset=cols,how='all'))
out[cols]=out[cols].fillna(df2[cols])
out=out.drop_duplicates()
现在如果你打印 out
你会得到你想要的输出
另一种方式:
indexes = df2.unstack().reset_index(level=0, drop=True) \
.rename('email').drop_duplicates().dropna() \
.reset_index()
df1 = df1.merge(indexes, how='left')
df1 = df1.merge(df2, left_on='index', right_index=True).drop(columns='index')
>>> df1
name email email_id alternate email alternate email2
0 Sara sara@example.com sara@example.com sara@example.com sara13@example.com
1 John john@example.com NaN john.walker@example.com john@example.com
2 Christine Christine@example.com flower@example8.com Christine33@example.com Christine@example.com