pandas 真正的外连接?
pandas true outer join?
如何在 pandas 中获得真正的外部联接?这意味着它实际上为您提供了整个输出,而不是组合要合并的列。在我看来,这有点愚蠢,因为这样就很难确定连续执行哪种操作。我一直这样做是为了检测我是否应该插入、更新或删除数据,但是我总是必须在列上创建额外的合并副本,这只是某些数据集的一大堆开销(有时是大量开销)。
示例:
import pandas as pd
keys = ["A","B"]
df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})
fullJoinDf = df1.merge(df2, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
display(
fullJoinDf,
)
A B C D C_r D_r
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
注意到它如何将 A
和 B
神奇地组合到一组列中。我想要的是我在 SQL outerjoins 等中得到的东西,例如:
A B C D A_r B_r C_r D_r
0 1 one testThis NaN NaN NaN NaN NaN
1 2 two testThat -3.656526e+18 2 two testThis -9.136326e+18
2 3 three testThis -8.571400e+18 3 three testThat -8.571400e+18
3 NaN NaN NaN NaN 4 four testThis -4.190116e+17
为@Felipe Whitaker 编辑
使用连接:
df3 = df1.copy().set_index(keys)
df4 = df2.copy().set_index(keys)
t = pd.concat([df3,df4], axis=1)
t.reset_index(),
A B C D C D
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
编辑示例*
鉴于答案,我发布了更多测试,因此任何偶然发现此问题的人都可以看到我在执行此操作时发现的更多“gatcha”变体。
import pandas as pd
keys = ["A","B"]
df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})
df3 = df1.copy()
df4 = df2.copy()
df3.index = df3[keys]
df4.index = df4[keys]
df5 = df1.copy().set_index(keys)
df6 = df2.copy().set_index(keys)
fullJoinDf = df5.merge(df6, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
fullJoinDf_2 = df3.merge(df4, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
t = pd.concat([df1,df2], axis=1, keys=["A","B"])
display(
df3.index,
df5.index,
fullJoinDf,
fullJoinDf_2,
t,
)
Index([(1, 'one'), (2, 'two'), (3, 'three')], dtype='object')
MultiIndex([(1, 'one'),
(2, 'two'),
(3, 'three')],
names=['A', 'B'])
A B C D C_r D_r
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
A B C D A_r B_r C_r D_r
(1, one) 1.0 one testThis NaN NaN NaN NaN NaN
(2, two) 2.0 two testThat -3.656526e+18 2.0 two testThis -9.136326e+18
(3, three) 3.0 three testThis -8.571400e+18 3.0 three testThat -8.571400e+18
(4, four) NaN NaN NaN NaN 4.0 four testThis -4.190116e+17
A B C D A B C D
0 1 one testThis NaN 2 two testThis -9136325526401183790
1 2 two testThat -3.656526e+18 3 three testThat -8571400026927442160
2 3 three testThis -8.571400e+18 4 four testThis -419011572131270498
如果您根本不关心原始索引:
df1.index = df1[keys]
df2.index = df2[keys]
fullJoinDf = df1.merge(df2, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
结果:
A B C D A_r B_r C_r D_r
0 1.0 one testThis NaN NaN NaN NaN NaN
1 2.0 two testThat 6.368540e+18 2.0 two testThis -6.457388e+18
2 3.0 three testThis -7.490461e+18 3.0 three testThat -7.490461e+18
3 NaN NaN NaN NaN 4.0 four testThis 4.344649e+18
如果您在 merge
之前重命名 DataFrames 1 中合并中使用的列,它看起来会给出正确的答案
df1.merge(df2.rename({'A': 'A_y', 'B': 'B_y'}, axis =1), left_on=keys, right_on=['A_y', 'B_y'], how='outer')
#output:
A B C_x D_x A_y B_y C_y D_y
0 1.0 one testThis NaN NaN NaN NaN NaN
1 2.0 two testThat -2.482945e+18 2.0 two testThis -1.215774e+18
2 3.0 three testThis 1.140152e+17 3.0 three testThat 1.140152e+17
3 NaN NaN NaN NaN 4.0 four testThis -4.915382e+18
如何在 pandas 中获得真正的外部联接?这意味着它实际上为您提供了整个输出,而不是组合要合并的列。在我看来,这有点愚蠢,因为这样就很难确定连续执行哪种操作。我一直这样做是为了检测我是否应该插入、更新或删除数据,但是我总是必须在列上创建额外的合并副本,这只是某些数据集的一大堆开销(有时是大量开销)。
示例:
import pandas as pd
keys = ["A","B"]
df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})
fullJoinDf = df1.merge(df2, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
display(
fullJoinDf,
)
A B C D C_r D_r
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
注意到它如何将 A
和 B
神奇地组合到一组列中。我想要的是我在 SQL outerjoins 等中得到的东西,例如:
A B C D A_r B_r C_r D_r
0 1 one testThis NaN NaN NaN NaN NaN
1 2 two testThat -3.656526e+18 2 two testThis -9.136326e+18
2 3 three testThis -8.571400e+18 3 three testThat -8.571400e+18
3 NaN NaN NaN NaN 4 four testThis -4.190116e+17
为@Felipe Whitaker 编辑
使用连接:
df3 = df1.copy().set_index(keys)
df4 = df2.copy().set_index(keys)
t = pd.concat([df3,df4], axis=1)
t.reset_index(),
A B C D C D
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
编辑示例* 鉴于答案,我发布了更多测试,因此任何偶然发现此问题的人都可以看到我在执行此操作时发现的更多“gatcha”变体。
import pandas as pd
keys = ["A","B"]
df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})
df3 = df1.copy()
df4 = df2.copy()
df3.index = df3[keys]
df4.index = df4[keys]
df5 = df1.copy().set_index(keys)
df6 = df2.copy().set_index(keys)
fullJoinDf = df5.merge(df6, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
fullJoinDf_2 = df3.merge(df4, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
t = pd.concat([df1,df2], axis=1, keys=["A","B"])
display(
df3.index,
df5.index,
fullJoinDf,
fullJoinDf_2,
t,
)
Index([(1, 'one'), (2, 'two'), (3, 'three')], dtype='object')
MultiIndex([(1, 'one'),
(2, 'two'),
(3, 'three')],
names=['A', 'B'])
A B C D C_r D_r
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
A B C D A_r B_r C_r D_r
(1, one) 1.0 one testThis NaN NaN NaN NaN NaN
(2, two) 2.0 two testThat -3.656526e+18 2.0 two testThis -9.136326e+18
(3, three) 3.0 three testThis -8.571400e+18 3.0 three testThat -8.571400e+18
(4, four) NaN NaN NaN NaN 4.0 four testThis -4.190116e+17
A B C D A B C D
0 1 one testThis NaN 2 two testThis -9136325526401183790
1 2 two testThat -3.656526e+18 3 three testThat -8571400026927442160
2 3 three testThis -8.571400e+18 4 four testThis -419011572131270498
如果您根本不关心原始索引:
df1.index = df1[keys]
df2.index = df2[keys]
fullJoinDf = df1.merge(df2, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
结果:
A B C D A_r B_r C_r D_r
0 1.0 one testThis NaN NaN NaN NaN NaN
1 2.0 two testThat 6.368540e+18 2.0 two testThis -6.457388e+18
2 3.0 three testThis -7.490461e+18 3.0 three testThat -7.490461e+18
3 NaN NaN NaN NaN 4.0 four testThis 4.344649e+18
如果您在 merge
之前重命名 DataFrames 1 中合并中使用的列,它看起来会给出正确的答案
df1.merge(df2.rename({'A': 'A_y', 'B': 'B_y'}, axis =1), left_on=keys, right_on=['A_y', 'B_y'], how='outer')
#output:
A B C_x D_x A_y B_y C_y D_y
0 1.0 one testThis NaN NaN NaN NaN NaN
1 2.0 two testThat -2.482945e+18 2.0 two testThis -1.215774e+18
2 3.0 three testThis 1.140152e+17 3.0 three testThat 1.140152e+17
3 NaN NaN NaN NaN 4.0 four testThis -4.915382e+18