Python Pandas 数据帧合并更新

Question

我的问题有点棘手（类似于 sql merge/update），并且不明白如何解决：（我在下面给出了一小部分数据帧）

我有两个数据框：

  dfOld 

     A   B  C  D  E
     x1  x2  g  h  r
     q1  q2  x  y  s
     t1  t2  h  j  u
     p1  p2  r  s  t

和

 dfNew 

          A   B  C  D   E
          x1  x2  a  b  c
          s1  s2  p  q  r
          t1  t2  h  j  u
          q1  q2  x  y  z

我们想按照以下规则合并数据帧：（我们可以将 Col A 和 ColB 视为键）

对于任何 ColA 和 ColB 组合，如果 C/D/E 完全匹配，则它从任何数据帧中获取值，但是如果 Col C/D/E 中的任何值发生变化，它从新数据帧中获取值，并且如果 DfNew 中有新的 ColA/Col B 组合，则它采用这些值，如果 dfNew 中不存在 ColA/ColB 组合，则它采用 dfOld 中的值：

所以我的输出应该是这样的：

            A   B  C  D   E
            x1  x2  a  b  c
            q1  q2  x  y  z
            t1  t2  h  j  u
            p1  p2  r  s  t
            s1  s2  p  q  r

我正在尝试：

    mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
    mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])


    dfO = pd.concat([mydfL,mydfR])

    dfO.drop("_merge", axis=1, inplace=True)

我的输出看起来像：（为清楚起见，我保留了索引）

            A   B  C  D  E
        0  x1  x2  a  b  c
        2  s1  s2  p  q  r
        3  q1  q2  x  y  z
        0  x1  x2  g  h  r
        2  q1  q2  x  y  s
        3  p1  p2  r  s  t

但是，这个输出不符合我的目的。首先，它不包括完全相同的行（在 dfOld 和 dfnew 之间），它包括：

          t1  t2  h  j  u

接下来它包括 ColA/Col x、y 和 q1、q2 的所有行，我只想要新数据框 (dfNew) 中 ColC/D/E 中的更新值.它包括来自两者的数据。

那么我能否得到一些帮助，了解我遗漏了什么，以及什么可能是更好、更优雅的方法来做到这一点。提前致谢。

Answer 1

您可以使用 combine_first 使用 A/B 作为临时索引：

out = (dfNew.set_index(['A', 'B'])
            .combine_first(dfOld.set_index(['A', 'B']))
            .reset_index()
      )

Python Pandas 数据帧合并更新

Python Pandas dataframes merge update

python

merge

insert-update

sql-update

pandas