如何比较 python 中的 2 个不相同的数据帧

Question

我有两个列顺序相同但列名和行不同的数据框。 df2 行不同于 df1 行。

df1=     col_id  num  name
      0   1     3     linda
      1   2     4     James

df2=     id     no   name
      0   1     2    granpa
      1   2     6    linda
      2   3     7    sam

这是我需要的输出。输出具有相同的旧值和新值的行，以便用户可以清楚地看到两个数据帧之间发生了什么变化：

 result   col_id        num              name   
      0   1             was 3| now 2    was linda| now granpa  
      1   2             was 4| now 6    was James| now linda
      2   was  | now 3  was  | now 7    was      | now sam

Answer 1

如果我没理解错的话，你想要这样的东西：

new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')

输出：

>>> new_df
   col_id  no    name
0       1   2  granpa
1       2   6   linda
2       3   7     sam

Answer 2

由于您的目标只是比较差异，因此请使用 DataFrame.compare 而不是聚合成字符串。

然而，

DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

所以我们只需要通过 merge or reindex.

对齐 row/column 索引

通过`merge`

对齐

Outer-merge两个dfs:

merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
#    col_id  num  name_x  id  no  name_y
# 0       1    3   linda   1   2  granpa
# 1       2    4   james   2   6   linda
# 2     NaN  NaN     NaN   3   7     sam

将merged帧分成left/right帧，并将它们的列与set_axis:

对齐

cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
#    col_id  num    name
# 0       1    3   linda
# 1       2    4   james
# 2     NaN  NaN     NaN

right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
#    col_id  num    name
# 0       1    2  granpa
# 1       2    6   linda
# 2       3    7     sam

compare 对齐的 left/right 帧（使用 keep_equal=True 显示相等的单元格）：

left.compare(right, keep_shape=True, keep_equal=True)
#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2   NaN     3   NaN     7    NaN    sam

left.compare(right, keep_shape=True)
#        col_id         num          name
#    self other  self other   self  other
# 0   NaN   NaN     3     2  linda granpa
# 1   NaN   NaN     4     6  james  linda
# 2   NaN     3   NaN     7    NaN    sam

通过`reindex`

对齐

如果您 100% 确定一个 df 是另一个的子集，那么 reindex 子集行。

在您的示例中，df1 是 df2 的子集，因此 reindex df1:

df1.assign(id=df1.col_id)          # copy col_id (we need original col_id after reindexing)
   .set_index('id')                # set index to copied id
   .reindex(df2.id)                # reindex against df2's id
   .reset_index(drop=True)         # remove copied id
   .set_axis(df2.columns, axis=1)  # align column names
   .compare(df2, keep_equal=True, keep_shape=True)

#        col_id         num          name
#    self other  self other   self  other
# 0     1     1     3     2  linda granpa
# 1     2     2     4     6  james  linda
# 2   NaN     3   NaN     7    NaN    sam

可为空的整数

通常int不能和nan混合，所以pandas转换为float。要将 int 值保持为 int（如上面的示例）：

理想情况下，我们将使用 astype('Int64')（大写 I）将 int 列转换为 nullable integers。
但是，目前有comparison bug with Int64，所以暂时使用astype(object)。

如何比较 python 中的 2 个不相同的数据帧

How to compare 2 non-identical dataframes in python

python

assert

dataframe

pandas

通过`merge`

通过`reindex`

可为空的整数

如何比较 python 中的 2 个不相同的数据帧

How to compare 2 non-identical dataframes in python

python

assert

dataframe

pandas

通过merge

通过reindex

可为空的整数

通过`merge`

通过`reindex`