合并多列 left_on 和 right_on 上的不相等数据框

merging unequal dataframes on multiple columns left_on & right_on

我有以下 df:

df1 = pd.DataFrame({'name':['Sara',  'John', 'Christine'],

                   'email': ['sara@example.com', 'john@example.com', 'Christine@example.com']})

df1:

    name          email
0   Sara         sara@example.com
1   John         john@example.com
2   Christine    Christine@example.com

和以下 df2 以获得更多客户的电子邮件信息:

df2 = pd.DataFrame({'email_id':['sara@example.com',  np.nan , 'flower@example8.com'],

                   'alternate email': ['sara@example.com', 'john.walker@example.com' , 'Christine33@example.com'],
                   'alternate email2': ['sara13@example.com', 'john@example.com', 'Christine@example.com']})

df2:

    email_id                alternate email             alternate email2
0   sara@example.com        sara@example.com            sara13@example.com
1   NaN                     john.walker@example.com         john@example.com
2   flower@example8.com     Christine33@example.com     Christine@example.com

现在我想合并左侧的两个数据帧 (df1) 并在 df2 的多个列上进行匹配

如果我使用 left_onright_on 合并:

df1.merge(df2, left_on='email', right_on='email_id', how='left')

那么只会匹配一位客户:

    name                  email               email_id          alternate email     alternate email2
0   Sara               sara@example.com       sara@example.com  sara@example.com    sara13@example.com
1   John               john@example.com             NaN                    NaN                NaN
2   Christine         Christine@example.com         NaN                    NaN                NaN

不知道有没有合并左边一列右边多列df的方法。我可以一个一个做多个匹配,但是不实用!

编辑:

预期输出:

    name            email                        email_id            alternate email            alternate email2
0   Sara            sara@example.com            sara@example.com     sara@example.com           sara13@example.com
1   John            john@example.com            NaN                  john.walker@example.com    john@example.com
2   Christine       Christine@example.com       flower@example8.com  Christine33@example.com    Christine@example.com

就像使用下面的代码:

df1.merge(df2, left_on='email', right_on=['email_id','alternate email','alternate email2'], how='left')

但是会出现dataframes长度不等的错误:

len(right_on) must equal len(left_on)

尝试:

想法是在列的每一列上合并 df1(存在于 df2 中)

cols=['email_id', 'alternate email', 'alternate email2']
out=(pd.concat([df1.merge(df2[x], left_on='email', right_on=x, how='left') for x in cols])
       .dropna(subset=cols,how='all'))
out[cols]=out[cols].fillna(df2[cols])
out=out.drop_duplicates()

现在如果你打印 out 你会得到你想要的输出

另一种方式:

indexes = df2.unstack().reset_index(level=0, drop=True) \
             .rename('email').drop_duplicates().dropna() \
             .reset_index()

df1 = df1.merge(indexes, how='left')
df1 = df1.merge(df2, left_on='index', right_index=True).drop(columns='index')
>>> df1
        name                  email             email_id          alternate email       alternate email2
0       Sara       sara@example.com     sara@example.com         sara@example.com     sara13@example.com
1       John       john@example.com                  NaN  john.walker@example.com       john@example.com
2  Christine  Christine@example.com  flower@example8.com  Christine33@example.com  Christine@example.com