Pandas 包含重复项的 Concat 数据帧

Question

我在连接两个不同长度的数据帧时遇到问题。以下是问题：

    df1 = 
emp_id emp_name counts
1      sam       0
2      joe       0
3      john      0
    
df2 =
emp_id emp_name counts
1      sam       0
2      joe       0
2      joe       1
3      john      0

我的预期输出是：请注意，我的期望不是将 2 个数据帧合并为一个，但我想并排连接两个数据帧并突出显示差异，如果一个 df 中有重复的行，例如 df2，则各自df1 的行应显示为 NaN/blank/None 任何类型的空值

Expected_output_df = 
df1                      df2    
empId   emp_name counts  emp_id   emp_name  counts
1       sam       0      1        sam       0
2       joe       0      2        joe       0
NaN     NaN       NaN    2        joe       1
3       john      0      3        john      0

而我得到的输出如下：

actual_output_df = pd.concat([df1, df2], axis='columns', keys=['df1','df2'])

the above code gives me below mentioned Dataframe. but how can I get the dataframe which is mentioned in the Expected output

actual_output_df = 
df1                      df2    
empId   emp_name counts  emp_id   emp_name  counts
1       sam       0      1        sam       0
2       joe       0      2        joe       0
3       john      0      2        joe       1
NaN     NaN       NaN    3        john      0

已尝试 pd.concat 传递不同的参数，但未获得预期结果。我在 concat 中遇到的主要问题是，我无法将重复的行向下移动一行。

谁能帮我解决这个问题？提前致谢

Answer 1

这并没有给出您所要求的确切输出，但无论如何它可以解决您的问题：

df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)

输出：

    emp_id  emp_name    counts  _merge
0   1       sam         0       both
1   2       joe         0       both
2   3       john        0       both
3   2       joe         1       right_only

您没有所需的包含 NaN 的行，但通过这种方式，您可以通过查看 _merge 列来检查某行是否在左侧 df、右侧 df 或两者中。您还可以使用 indicator='name'.

为该列指定自定义名称

更新

要获得您想要的准确输出，您可以执行以下操作：

output_df = df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)

output_df[['emp_id2', 'emp_name2', 'counts2']] = output_df[['emp_id', 'emp_name', 'counts']]

output_df.loc[output_df._merge == 'right_only', ['emp_id', 'emp_name', 'counts']] = np.nan
output_df.loc[output_df._merge == 'left_only', ['emp_id2', 'emp_name2', 'counts2']] = np.nan
output_df = output_df.drop('_merge', axis=1)

output_df.columns = pd.MultiIndex.from_tuples([('df1', 'emp_id'), ('df1', 'emp_name'), ('df1', 'counts'), 
                     ('df2', 'emp_id'), ('df2', 'emp_name'), ('df2', 'counts')])

输出：

    df1                         df2
    emp_id  emp_name    counts  emp_id  emp_name    counts
0   1.0     sam         0.0     1.0     sam         0.0
1   2.0     joe         0.0     2.0     joe         0.0
2   3.0     john        0.0     3.0     john        0.0
3   NaN     NaN         NaN     2.0     joe         1.0

Pandas 包含重复项的 Concat 数据帧

Pandas Concat dataframes with Duplicates

python

concatenation

dataframe

pandas