pandas:基于多列将行附加到相似行下的另一个数据框

pandas: append rows to another dataframe under the similar row based on multiple columns

我问了一个非常相似的问题 here 但想知道如果必须依赖多列来执行追加,是否有办法解决这个问题。 所以数据帧如下所示,

    import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
 'col4': ['min', 'max', 'mid','min'],
 'col5': ['min', 'max', 'max','max']}

d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
     'col4': ['min', 'max', 'max'],
     'col5': ['max', 'min', 'mid']}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

所以这一次,我想将 df2 中的行附加到 df1 中的相似行下,前提是所有 col1、col2、col3 中的行都相似。所以输出是,

    col1         col2          col3       col4 col5
 0 I ate dinner  I ate dinner  I went out  min  min
 1 I ate dinner  I ate dinner  I went out  min  max
 2  the play was inetresting the gold is shining  the house is nice  min  max
 3  the play was inetresting the gold is shining  the house is nice  max  mid

所以我尝试了以下方法,

df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)

但是我收到这个错误,

 TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

好的,我意识到自己的错误,并会 post 在这里回答,以防任何人都感兴趣,(答案基于问题中的 link)

print(pd.concat([df1, df2[df2.set_index(['col1','col2','col3']).index.isin(df1.set_index(['col1','col2','col3']).index)]]).sort_values(['col1','col2','col3'], ignore_index=True))

另一个解决方案是使用 pd.mergepd.wide_to_long:

out = (
    pd.wide_to_long(
        pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index(),
        stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
      .sort_index().reset_index(drop=True)[df1.columns]
)

输出:

>>> out
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

循序渐进

# Step 1: merge
>>> out = pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index()
   index                      col1                 col2               col3 col4_x col5_x col4_y col5_y
0      0              I ate dinner         I ate dinner         I went out    min    min    min    max
1      1  the play was inetresting  the gold is shining  the house is nice    min    max    max    mid

# Step 2: wide_to_long
>>> out = pd.wide_to_long(out, stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
                        col3                 col2                      col1 col4 col5
index val                                                                            
0     x           I went out         I ate dinner              I ate dinner  min  min
1     x    the house is nice  the gold is shining  the play was inetresting  min  max
0     y           I went out         I ate dinner              I ate dinner  min  max
1     y    the house is nice  the gold is shining  the play was inetresting  max  mid

# Step 3: reorder dataframe
>>> out = out.sort_index().reset_index(drop=True)[df1.columns]
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

我强烈建议您提出数字最小示例,而不是基于文本的示例。更容易阅读,更容易理解。话虽这么说,如果我理解正确,你想要 df1 的每一行:

  1. 检查 df2 中是否有某些行在某些列上具有相同的值。
  2. 将这些行附加到 df1,就在所述行的后面。

当然,我们可以讨论df1中重复的情况,以及你想如何处理它们。然后,我们可以编写两种解决方案,一种使用 for 循环,另一种使用 Pandas 中的函数式编程(取决于您的技能、习惯和其他偏好)。

一种for-loop方法

假设 df1 中没有重复项,则:

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

cols = ["col1", "col2", "col3"]

## create a new dataframe
new_df = pd.DataFrame()

## loop over all rows of df1
for _, row1 in df1.iterrows():

    ## check for equality with df2, we need all elements cols being equal to select these rows
    is_eq_to_row_1 = df2.eq(row1.loc[cols]).loc[:, cols].all(axis=1)

    ## if at least one row of df2 is equal to row, append them
    if is_eq_to_row_1.any():
        ## first append the row1
        new_df = new_df.append(row1) 
        ## then all rows of df2 equal to row1
        new_df = new_df.append(df2[is_eq_to_row_1])

函数式方法

我还没有时间写一个合适的解决方案,但我猜它暗示了 groupby、apply 和一堆 Pandas 相关的功能。对于每一行 x,我们仍然使用 df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)] 到 select df2 的行等于 x。

我们只是“循环”遍历所有行。设计的工具可以是 groupby。那么我们就不再关心重复了。

new_df = df1.groupby(cols). \
         apply(lambda x : pd.concat([x, 
                                     df2[df2.eq(x.iloc[0, :].loc[cols]). \
                                         loc[:, cols].all(axis=1)]]))

还有一些工作要做,如果没有找到 df2 的行,则不追加行,并清理输出。