如何进行完全外部连接，不包括两个 pandas 数据帧之间的交集？

Question

我有两个具有相同列 headers 的数据集，我想删除所有 100% 相同的数据，只保留它们不完全相同的部分。我该怎么做呢？

感谢您的宝贵时间！

Answer 1

要获取除两个 pandas 数据集的交集以外的所有内容，请尝试以下操作：

# Everything from the first except what is on second
r1 = df1[~df1.isin(df2)]

# Everything from the second except what is on first
r2 = df2[~df2.isin(df1)]

# concatenate and drop NANs
result = pd.concat(
    [r1, r2]
).dropna().reset_index(drop=True)

但有一个警告，当使用布尔掩码过滤时，您的 int 值可能会变成浮点数。默认情况下，pandas 将不需要的 (False) 值替换为 NAN 的浮点版本并将整个列转换为浮点。您可以在下面的示例中看到这种情况。

要避免这种情况，请在创建数据帧时显式声明数据类型。

例子

import pandas as pd

df1 = pd.read_csv("./csv1.csv") #, dtype='Int64')
print(f"csv1\n{df1}\n")

df2 = pd.read_csv("./csv2.csv") #, dtype='Int64')
print(f"csv2\n{df2}\n")

# Everything from first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from second except what is on first
r2 = df2[~df2.isin(df1)]

# concatenate and drop NANs
result = pd.concat(
    [r1, r2]
).dropna().reset_index(drop=True)

print(f"result\n{result}\n")

输入

csv1
   A   B   C
0  1   2   3
1  4   5   6
2  7   8   9

csv2
    A   B   C
0   1   2   3
1   4   5   6
2  10  11  12

输出

result
      A     B     C
0   7.0   8.0   9.0
1  10.0  11.0  12.0

如何进行完全外部连接，不包括两个 pandas 数据帧之间的交集？

How to do a full outer join excluding the intersection between two pandas dataframes?

python

join

duplicates

pandas

例子

输入

输出