在合并期间仅向第一个组合添加值
Adding value only to a first combination during merge
我有两个 dfs
:
df_1
date id value
2021-01-01 A1 100
2021-01-01 A1 200
2021-01-01 A1 300
2021-01-02 A1 100
2021-01-02 A1 200
2021-01-03 A1 500
2021-01-03 A1 800
df_2
date id value_to_add
2021-01-01 A1 150
2021-01-03 A1 350
我正在尝试维护 df_1
的结构并在合并期间在第一次出现时添加 value_to_add
以便在填充 NaN
和除了第一个值之外的所有值都带有 0
:
date id value value_to_add
2021-01-01 A1 100 150
2021-01-01 A1 200 0 # 0 because the 150 have been already added
2021-01-01 A1 300 0
2021-01-02 A1 100 0 # 0 because value_to_add does not exist
2021-01-02 A1 200 0
2021-01-03 A1 500 350
2021-01-03 A1 800 0 # 0 because the 350 have been already added
我的第一个想法是删除 ['date', 'id']
子集的副本,然后将 df_2
合并到它,但我不确定如何返回到 [=14= 的原始结构].
所以问题如下 - 能够在 pd.merge
操作 期间第一次出现键时合并。我找不到关于这个主题的任何内容,坦率地说,我不确定如何才能做到这一点。
您可以通过 DataFrame.duplicated
with invert mask and Index.union
过滤重复值以避免删除从 merge
添加的新列:
df_1.loc[~df_1.duplicated(['date', 'id']),
df_1.columns.union(df_2.columns)] = df_1.merge(df_2, how='left')
df_1 = df_1.fillna(0)
print (df_1)
date id value value_to_add
0 2021-01-01 A1 100 150.0
1 2021-01-01 A1 200 0.0
2 2021-01-01 A1 300 0.0
3 2021-01-02 A1 100 0.0
4 2021-01-02 A1 200 0.0
5 2021-01-03 A1 500 350.0
6 2021-01-03 A1 800 0.0
辅助计数器列的另一个想法:
df_1 = df_1.assign(g = df_1.groupby(['date', 'id']).cumcount()).merge(df_2.assign(g=0), how='left')
df_1 = df_1.drop('g', 1).fillna(0)
print (df_1)
date id value value_to_add
0 2021-01-01 A1 100 150.0
1 2021-01-01 A1 200 0.0
2 2021-01-01 A1 300 0.0
3 2021-01-02 A1 100 0.0
4 2021-01-02 A1 200 0.0
5 2021-01-03 A1 500 350.0
6 2021-01-03 A1 800 0.0
s =df_1.set_index(['date','id']).join(df_2.set_index(['date','id']))
s=s.assign(value_to_add=np.where(~s['value_to_add'].duplicated(keep='first'),s['value_to_add'],np.nan)).fillna(0)
我有两个 dfs
:
df_1
date id value
2021-01-01 A1 100
2021-01-01 A1 200
2021-01-01 A1 300
2021-01-02 A1 100
2021-01-02 A1 200
2021-01-03 A1 500
2021-01-03 A1 800
df_2
date id value_to_add
2021-01-01 A1 150
2021-01-03 A1 350
我正在尝试维护 df_1
的结构并在合并期间在第一次出现时添加 value_to_add
以便在填充 NaN
和除了第一个值之外的所有值都带有 0
:
date id value value_to_add
2021-01-01 A1 100 150
2021-01-01 A1 200 0 # 0 because the 150 have been already added
2021-01-01 A1 300 0
2021-01-02 A1 100 0 # 0 because value_to_add does not exist
2021-01-02 A1 200 0
2021-01-03 A1 500 350
2021-01-03 A1 800 0 # 0 because the 350 have been already added
我的第一个想法是删除 ['date', 'id']
子集的副本,然后将 df_2
合并到它,但我不确定如何返回到 [=14= 的原始结构].
所以问题如下 - 能够在 pd.merge
操作 期间第一次出现键时合并。我找不到关于这个主题的任何内容,坦率地说,我不确定如何才能做到这一点。
您可以通过 DataFrame.duplicated
with invert mask and Index.union
过滤重复值以避免删除从 merge
添加的新列:
df_1.loc[~df_1.duplicated(['date', 'id']),
df_1.columns.union(df_2.columns)] = df_1.merge(df_2, how='left')
df_1 = df_1.fillna(0)
print (df_1)
date id value value_to_add
0 2021-01-01 A1 100 150.0
1 2021-01-01 A1 200 0.0
2 2021-01-01 A1 300 0.0
3 2021-01-02 A1 100 0.0
4 2021-01-02 A1 200 0.0
5 2021-01-03 A1 500 350.0
6 2021-01-03 A1 800 0.0
辅助计数器列的另一个想法:
df_1 = df_1.assign(g = df_1.groupby(['date', 'id']).cumcount()).merge(df_2.assign(g=0), how='left')
df_1 = df_1.drop('g', 1).fillna(0)
print (df_1)
date id value value_to_add
0 2021-01-01 A1 100 150.0
1 2021-01-01 A1 200 0.0
2 2021-01-01 A1 300 0.0
3 2021-01-02 A1 100 0.0
4 2021-01-02 A1 200 0.0
5 2021-01-03 A1 500 350.0
6 2021-01-03 A1 800 0.0
s =df_1.set_index(['date','id']).join(df_2.set_index(['date','id']))
s=s.assign(value_to_add=np.where(~s['value_to_add'].duplicated(keep='first'),s['value_to_add'],np.nan)).fillna(0)