如何根据特定功能仅连接数据框中的非冗余行？

Question

我有一个非常棘手的问题，需要执行 1 和 2 的连接。输出应类似于 Out。

1:
A B C | Y
1 1 5   1 <---- keep
2 2 5   1 <---- keep

2:
A B C | Y
1 1 6   0 <---- drop, because duplicated on subset=[A,B] with row of table 1.
1 2 6   0 <---- keep
3 3 6   0 <---- keep, despite duplicated on subset=[A,B] within this table.
3 3 7   0 <---- keep, despite duplicated on subset=[A,B] within this table.

Out:
A B C | Y
1 1 5   1
1 2 6   0
2 2 5   1
3 3 6   0
3 3 7   0

因此，如您所见，我不能在连接后仅删除基于 subset=[A,B] 的重复项。这也会删除行 3 3 6 0 和 3 3 7 0。

总结一下：我只想连接 1 和 2。如果 table 2 中有一行与 table 1 中的 A 和 B 值相同，我会喜欢只保留 table 的行 1. 我不想删除 table 中基于 A 和 B 的其他重复行 2.

此致

Answer 1

我认为使用 full outer join 的类似以下内容应该有效（如果需要，您可能希望在输出 table 中记录行）：

导入 pandas 作为 pd 将 numpy 导入为 np

df1 = pd.DataFrame([[1, 1, 5, 1], [2, 2, 5, 1]], columns = ['A','B','C', 'Y'])
df2 = pd.DataFrame([[1, 1, 6, 0], [1, 2, 6, 0], [3, 3, 6, 0], [3, 3, 7, 0]], columns = ['A','B','C', 'Y'])
df = pd.merge(df1, df2, on=['A','B'], how='outer')
df['C'] = df.apply(lambda row: row.C_x if not np.isnan(row.C_x) else row.C_y, axis=1).astype(int)
df['Y'] = df.apply(lambda row: row.Y_x if not np.isnan(row.C_x) else row.Y_y, axis=1).astype(int)
df = df[['A','B','C','Y']]
df.head()

#   A  B  C  Y
#0  1  1  5  1
#1  2  2  5  1
#2  1  2  6  0
#3  3  3  6  0
#4  3  3  7  0

Answer 2

我有一个和Sandipan类似的解决方案，但是我使用了inner join来做。

import pandas as pd
df1 = pd.DataFrame([[1, 1, 5, 1], [2, 2, 5, 1]], columns = ['A','B','C', 'Y'])
df2 = pd.DataFrame([[1, 1, 6, 0], [1, 2, 6, 0], [3, 3, 6, 0], [3, 3, 7, 0]], columns = ['A','B','C', 'Y'])

# Add an index for df2
df2['idx'] = range(len(df2))

# Find the index of common rows by inner join
common_row = pd.merge(df1, df2, on=['A','B'], how='inner').idx.tolist()

# Remove common rows in df2
df2 = df2[~df2.idx.isin(common_row)]
df2 = df2.iloc[:,0:-1]

# Concat df1 and df2
df = pd.concat([df1, df2])
df = df.sort_values(by=['A','B'], ascending=[True, True])
df

如何根据特定功能仅连接数据框中的非冗余行？

How to concatenate only non-redundant rows in dataframe based on specific features?

python

numpy

outer-join

dataframe

pandas