添加一列说明一条记录是否跨数据集出现
Add a column stating whether a record occurred across datasets
我有 2 个 dfs,我想在其上联系并删除重复项,但在添加一列说明来自 df_b 的记录(由于重复数据删除而将被删除)是否可以说明它是否发生之前或不跨两个 dfs,否则该列将保持空白,说明 df_b 中没有出现该记录(不是跨 dfs 的重复项)。
想要的结果df_combined
df_a
title director
0 Toy Story John Lasseter
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese
df_b
title director
0 Toy Story John Lass
1 The Hangover Todd Phillips
2 Rocky John Avildsen
3 The Departed Martin Scorsese
df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
df_combined
title director. occurence_both
0 Toy Story John Lasseter b
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese b
5 The Hangover Todd Phillips
6 Rocky John Avildsen
我们可以使用duplicated
with keep=False
to mark all duplicates and np.where
to convert from boolean series to 'b' and ''. Then followup with drop_duplicates
来删除重复的行。这两个操作都应该是 title
列的子集:
df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
# Mark Duplicates
df_combine['occurence_both'] = np.where(
df_combine.duplicated(subset='title', keep=False), 'b', ''
)
# Drop Duplicates
df_combine = df_combine.drop_duplicates(subset='title')
df_combine
:
title director occurence_both
0 Toy Story John Lasseter b
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese b
5 The Hangover Todd Phillips
6 Rocky John Avildsen
我有 2 个 dfs,我想在其上联系并删除重复项,但在添加一列说明来自 df_b 的记录(由于重复数据删除而将被删除)是否可以说明它是否发生之前或不跨两个 dfs,否则该列将保持空白,说明 df_b 中没有出现该记录(不是跨 dfs 的重复项)。
想要的结果df_combined
df_a
title director
0 Toy Story John Lasseter
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese
df_b
title director
0 Toy Story John Lass
1 The Hangover Todd Phillips
2 Rocky John Avildsen
3 The Departed Martin Scorsese
df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
df_combined
title director. occurence_both
0 Toy Story John Lasseter b
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese b
5 The Hangover Todd Phillips
6 Rocky John Avildsen
我们可以使用duplicated
with keep=False
to mark all duplicates and np.where
to convert from boolean series to 'b' and ''. Then followup with drop_duplicates
来删除重复的行。这两个操作都应该是 title
列的子集:
df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
# Mark Duplicates
df_combine['occurence_both'] = np.where(
df_combine.duplicated(subset='title', keep=False), 'b', ''
)
# Drop Duplicates
df_combine = df_combine.drop_duplicates(subset='title')
df_combine
:
title director occurence_both
0 Toy Story John Lasseter b
1 Goodfellas Martin Scorsese
2 Meet the Fockers Jay Roach
3 The Departed Martin Scorsese b
5 The Hangover Todd Phillips
6 Rocky John Avildsen