添加一列说明一条记录是否跨数据集出现

Add a column stating whether a record occurred across datasets

我有 2 个 dfs,我想在其上联系并删除重复项,但在添加一列说明来自 df_b 的记录(由于重复数据删除而将被删除)是否可以说明它是否发生之前或不跨两个 dfs,否则该列将保持空白,说明 df_b 中没有出现该记录(不是跨 dfs 的重复项)。

想要的结果df_combined

df_a

    title             director
0   Toy Story         John Lasseter
1   Goodfellas        Martin Scorsese
2   Meet the Fockers  Jay Roach
3   The Departed      Martin Scorsese

df_b

    title             director
0   Toy Story         John Lass
1   The Hangover      Todd Phillips
2   Rocky             John Avildsen
3   The Departed      Martin Scorsese


df_combine =  pd.concat([df_a, df_b], ignore_index=True, sort=False)
df_combined

title                 director.         occurence_both
0   Toy Story         John Lasseter     b
1   Goodfellas        Martin Scorsese
2   Meet the Fockers  Jay Roach      
3   The Departed      Martin Scorsese   b
5   The Hangover      Todd Phillips
6   Rocky             John Avildsen

我们可以使用duplicated with keep=False to mark all duplicates and np.where to convert from boolean series to 'b' and ''. Then followup with drop_duplicates 来删除重复的行。这两个操作都应该是 title 列的子集:

df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
# Mark Duplicates
df_combine['occurence_both'] = np.where(
    df_combine.duplicated(subset='title', keep=False), 'b', ''
)
# Drop Duplicates
df_combine = df_combine.drop_duplicates(subset='title')

df_combine:

              title         director occurence_both
0         Toy Story    John Lasseter              b
1        Goodfellas  Martin Scorsese               
2  Meet the Fockers        Jay Roach               
3      The Departed  Martin Scorsese              b
5      The Hangover    Todd Phillips               
6             Rocky    John Avildsen