pandas 真正的外连接?

pandas true outer join?

如何在 pandas 中获得真正的外部联接?这意味着它实际上为您提供了整个输出,而不是组合要合并的列。在我看来,这有点愚蠢,因为这样就很难确定连续执行哪种操作。我一直这样做是为了检测我是否应该插入、更新或删除数据,但是我总是必须在列上创建额外的合并副本,这只是某些数据集的一大堆开销(有时是大量开销)。

示例:

import pandas as pd

keys = ["A","B"]

df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})

fullJoinDf = df1.merge(df2, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
display(
    fullJoinDf,
)

    A   B       C           D               C_r          D_r
0   1   one     testThis    NaN             NaN          NaN
1   2   two     testThat    -3.656526e+18   testThis    -9.136326e+18
2   3   three   testThis    -8.571400e+18   testThat    -8.571400e+18
3   4   four    NaN         NaN             testThis    -4.190116e+17

注意到它如何将 AB 神奇地组合到一组列中。我想要的是我在 SQL outerjoins 等中得到的东西,例如:

    A    B      C           D               A_r  B_r     C_r        D_r
0   1    one    testThis    NaN             NaN  NaN     NaN        NaN     
1   2    two    testThat    -3.656526e+18   2    two     testThis   -9.136326e+18
2   3    three  testThis    -8.571400e+18   3    three   testThat   -8.571400e+18
3   NaN  NaN    NaN         NaN             4    four    testThis   -4.190116e+17

为@Felipe Whitaker 编辑

使用连接:

df3 = df1.copy().set_index(keys)
df4 = df2.copy().set_index(keys)
t = pd.concat([df3,df4], axis=1)
t.reset_index(), 

    A   B       C           D               C           D
0   1   one     testThis    NaN             NaN         NaN
1   2   two     testThat    -3.656526e+18   testThis    -9.136326e+18
2   3   three   testThis    -8.571400e+18   testThat    -8.571400e+18
3   4   four    NaN         NaN             testThis    -4.190116e+17

编辑示例* 鉴于答案,我发布了更多测试,因此任何偶然发现此问题的人都可以看到我在执行此操作时发现的更多“gatcha”变体。

import pandas as pd

keys = ["A","B"]

df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})

df3 = df1.copy()
df4 = df2.copy()
df3.index = df3[keys]
df4.index = df4[keys]

df5 = df1.copy().set_index(keys)
df6 = df2.copy().set_index(keys)


fullJoinDf = df5.merge(df6, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
fullJoinDf_2 = df3.merge(df4, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
t = pd.concat([df1,df2], axis=1, keys=["A","B"])
display(
    df3.index,
    df5.index,
    fullJoinDf,
    fullJoinDf_2,
    t,
)

Index([(1, 'one'), (2, 'two'), (3, 'three')], dtype='object')
MultiIndex([(1,   'one'),
            (2,   'two'),
            (3, 'three')],
           names=['A', 'B'])

    A   B       C           D               C_r         D_r
0   1   one     testThis    NaN             NaN         NaN
1   2   two     testThat    -3.656526e+18   testThis    -9.136326e+18
2   3   three   testThis    -8.571400e+18   testThat    -8.571400e+18
3   4   four    NaN         NaN             testThis    -4.190116e+17

            A    B      C           D               A_r  B_r    C_r        D_r
(1, one)    1.0  one    testThis    NaN             NaN  NaN    NaN        NaN
(2, two)    2.0  two    testThat    -3.656526e+18   2.0  two    testThis    -9.136326e+18
(3, three)  3.0  three  testThis    -8.571400e+18   3.0  three  testThat    -8.571400e+18
(4, four)   NaN  NaN    NaN         NaN             4.0  four   testThis    -4.190116e+17

    A   B       C           D               A   B       C           D
0   1   one     testThis    NaN             2   two     testThis    -9136325526401183790
1   2   two     testThat    -3.656526e+18   3   three   testThat    -8571400026927442160
2   3   three   testThis    -8.571400e+18   4   four    testThis    -419011572131270498

如果您根本不关心原始索引:

df1.index = df1[keys]
df2.index = df2[keys]

fullJoinDf = df1.merge(df2, how="outer", left_index=True, right_index=True, suffixes=["","_r"])

结果:

     A      B         C             D  A_r    B_r       C_r           D_r
0  1.0    one  testThis           NaN  NaN    NaN       NaN           NaN
1  2.0    two  testThat  6.368540e+18  2.0    two  testThis -6.457388e+18
2  3.0  three  testThis -7.490461e+18  3.0  three  testThat -7.490461e+18
3  NaN    NaN       NaN           NaN  4.0   four  testThis  4.344649e+18

如果您在 merge 之前重命名 DataFrames 1 中合并中使用的列,它看起来会给出正确的答案

df1.merge(df2.rename({'A': 'A_y', 'B': 'B_y'}, axis =1), left_on=keys, right_on=['A_y', 'B_y'], how='outer')
#output:
    A   B       C_x         D_x             A_y     B_y     C_y         D_y
0   1.0 one     testThis    NaN             NaN     NaN     NaN         NaN
1   2.0 two     testThat    -2.482945e+18   2.0     two     testThis    -1.215774e+18
2   3.0 three   testThis    1.140152e+17    3.0     three   testThat    1.140152e+17
3   NaN NaN     NaN         NaN             4.0     four    testThis    -4.915382e+18