如何比较 python 中的 2 个不相同的数据帧

How to compare 2 non-identical dataframes in python

我有两个列顺序相同但列名和行不同的数据框。 df2 行不同于 df1 行。

df1=     col_id  num  name
      0   1     3     linda
      1   2     4     James

df2=     id     no   name
      0   1     2    granpa
      1   2     6    linda
      2   3     7    sam

这是我需要的输出。输出具有相同的旧值和新值的行,以便用户可以清楚地看到两个数据帧之间发生了什么变化:

 result   col_id        num              name   
      0   1             was 3| now 2    was linda| now granpa  
      1   2             was 4| now 6    was James| now linda
      2   was  | now 3  was  | now 7    was      | now sam

如果我没理解错的话,你想要这样的东西:

new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')

输出:

>>> new_df
   col_id  no    name
0       1   2  granpa
1       2   6   linda
2       3   7     sam

由于您的目标只是比较差异,因此请使用 DataFrame.compare 而不是聚合成字符串。

然而,

DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

所以我们只需要通过 merge or reindex.

对齐 row/column 索引

通过merge

对齐
  1. Outer-merge两个dfs:

    merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
    #    col_id  num  name_x  id  no  name_y
    # 0       1    3   linda   1   2  granpa
    # 1       2    4   james   2   6   linda
    # 2     NaN  NaN     NaN   3   7     sam
    
  2. merged帧分成left/right帧,并将它们的列与set_axis:

    对齐
    cols = df1.columns
    left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
    #    col_id  num    name
    # 0       1    3   linda
    # 1       2    4   james
    # 2     NaN  NaN     NaN
    
    right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
    #    col_id  num    name
    # 0       1    2  granpa
    # 1       2    6   linda
    # 2       3    7     sam
    
  3. compare 对齐的 left/right 帧(使用 keep_equal=True 显示相等的单元格):

    left.compare(right, keep_shape=True, keep_equal=True)
    #        col_id         num          name
    #    self other  self other   self  other
    # 0     1     1     3     2  linda granpa
    # 1     2     2     4     6  james  linda
    # 2   NaN     3   NaN     7    NaN    sam
    
    left.compare(right, keep_shape=True)
    #        col_id         num          name
    #    self other  self other   self  other
    # 0   NaN   NaN     3     2  linda granpa
    # 1   NaN   NaN     4     6  james  linda
    # 2   NaN     3   NaN     7    NaN    sam
    

通过reindex

对齐

如果您 100% 确定一个 df 是另一个的子集,那么 reindex 子集行。

在您的示例中,df1df2 的子集,因此 reindex df1:

df1.assign(id=df1.col_id)          # copy col_id (we need original col_id after reindexing)
   .set_index('id')                # set index to copied id
   .reindex(df2.id)                # reindex against df2's id
   .reset_index(drop=True)         # remove copied id
   .set_axis(df2.columns, axis=1)  # align column names
   .compare(df2, keep_equal=True, keep_shape=True)

#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2   NaN     3   NaN     7    NaN    sam

可为空的整数

通常int不能和nan混合,所以pandas转换为float。要将 int 值保持为 int(如上面的示例):