如何比较 python 中的 2 个不相同的数据帧
How to compare 2 non-identical dataframes in python
我有两个列顺序相同但列名和行不同的数据框。 df2
行不同于 df1
行。
df1= col_id num name
0 1 3 linda
1 2 4 James
df2= id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
这是我需要的输出。输出具有相同的旧值和新值的行,以便用户可以清楚地看到两个数据帧之间发生了什么变化:
result col_id num name
0 1 was 3| now 2 was linda| now granpa
1 2 was 4| now 6 was James| now linda
2 was | now 3 was | now 7 was | now sam
如果我没理解错的话,你想要这样的东西:
new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')
输出:
>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
由于您的目标只是比较差异,因此请使用 DataFrame.compare
而不是聚合成字符串。
然而,
DataFrame.compare
can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
对齐 row/column 索引
通过merge
对齐
Outer-merge
两个dfs:
merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 sam
将merged
帧分成left
/right
帧,并将它们的列与set_axis
:
对齐
cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN
right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 sam
compare
对齐的 left
/right
帧(使用 keep_equal=True
显示相等的单元格):
left.compare(right, keep_shape=True, keep_equal=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
left.compare(right, keep_shape=True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
通过reindex
对齐
如果您 100% 确定一个 df 是另一个的子集,那么 reindex
子集行。
在您的示例中,df1
是 df2
的子集,因此 reindex
df1
:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
可为空的整数
通常int
不能和nan
混合,所以pandas转换为float
。要将 int
值保持为 int
(如上面的示例):
- 理想情况下,我们将使用
astype('Int64')
(大写 I
)将 int
列转换为 nullable integers。
- 但是,目前有comparison bug with
Int64
,所以暂时使用astype(object)
。
我有两个列顺序相同但列名和行不同的数据框。 df2
行不同于 df1
行。
df1= col_id num name
0 1 3 linda
1 2 4 James
df2= id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
这是我需要的输出。输出具有相同的旧值和新值的行,以便用户可以清楚地看到两个数据帧之间发生了什么变化:
result col_id num name
0 1 was 3| now 2 was linda| now granpa
1 2 was 4| now 6 was James| now linda
2 was | now 3 was | now 7 was | now sam
如果我没理解错的话,你想要这样的东西:
new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')
输出:
>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
由于您的目标只是比较差异,因此请使用 DataFrame.compare
而不是聚合成字符串。
然而,
对齐 row/column 索引
DataFrame.compare
can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
通过merge
对齐
Outer-
merge
两个dfs:merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id') # col_id num name_x id no name_y # 0 1 3 linda 1 2 granpa # 1 2 4 james 2 6 linda # 2 NaN NaN NaN 3 7 sam
将
对齐merged
帧分成left
/right
帧,并将它们的列与set_axis
:cols = df1.columns left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1) # col_id num name # 0 1 3 linda # 1 2 4 james # 2 NaN NaN NaN right = merged.iloc[:, len(cols):].set_axis(cols, axis=1) # col_id num name # 0 1 2 granpa # 1 2 6 linda # 2 3 7 sam
compare
对齐的left
/right
帧(使用keep_equal=True
显示相等的单元格):left.compare(right, keep_shape=True, keep_equal=True) # col_id num name # self other self other self other # 0 1 1 3 2 linda granpa # 1 2 2 4 6 james linda # 2 NaN 3 NaN 7 NaN sam left.compare(right, keep_shape=True) # col_id num name # self other self other self other # 0 NaN NaN 3 2 linda granpa # 1 NaN NaN 4 6 james linda # 2 NaN 3 NaN 7 NaN sam
通过reindex
对齐
如果您 100% 确定一个 df 是另一个的子集,那么 reindex
子集行。
在您的示例中,df1
是 df2
的子集,因此 reindex
df1
:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
可为空的整数
通常int
不能和nan
混合,所以pandas转换为float
。要将 int
值保持为 int
(如上面的示例):
- 理想情况下,我们将使用
astype('Int64')
(大写I
)将int
列转换为 nullable integers。 - 但是,目前有comparison bug with
Int64
,所以暂时使用astype(object)
。