pandas:基于多列将行附加到相似行下的另一个数据框
pandas: append rows to another dataframe under the similar row based on multiple columns
我问了一个非常相似的问题 here 但想知道如果必须依赖多列来执行追加,是否有办法解决这个问题。
所以数据帧如下所示,
import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
'col4': ['min', 'max', 'mid','min'],
'col5': ['min', 'max', 'max','max']}
d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
'col4': ['min', 'max', 'max'],
'col5': ['max', 'min', 'mid']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
所以这一次,我想将 df2 中的行附加到 df1 中的相似行下,前提是所有 col1、col2、col3 中的行都相似。所以输出是,
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
所以我尝试了以下方法,
df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)
但是我收到这个错误,
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
好的,我意识到自己的错误,并会 post 在这里回答,以防任何人都感兴趣,(答案基于问题中的 link)
print(pd.concat([df1, df2[df2.set_index(['col1','col2','col3']).index.isin(df1.set_index(['col1','col2','col3']).index)]]).sort_values(['col1','col2','col3'], ignore_index=True))
另一个解决方案是使用 pd.merge
和 pd.wide_to_long
:
out = (
pd.wide_to_long(
pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index(),
stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
.sort_index().reset_index(drop=True)[df1.columns]
)
输出:
>>> out
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
循序渐进
# Step 1: merge
>>> out = pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index()
index col1 col2 col3 col4_x col5_x col4_y col5_y
0 0 I ate dinner I ate dinner I went out min min min max
1 1 the play was inetresting the gold is shining the house is nice min max max mid
# Step 2: wide_to_long
>>> out = pd.wide_to_long(out, stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
col3 col2 col1 col4 col5
index val
0 x I went out I ate dinner I ate dinner min min
1 x the house is nice the gold is shining the play was inetresting min max
0 y I went out I ate dinner I ate dinner min max
1 y the house is nice the gold is shining the play was inetresting max mid
# Step 3: reorder dataframe
>>> out = out.sort_index().reset_index(drop=True)[df1.columns]
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
我强烈建议您提出数字最小示例,而不是基于文本的示例。更容易阅读,更容易理解。话虽这么说,如果我理解正确,你想要 df1 的每一行:
- 检查 df2 中是否有某些行在某些列上具有相同的值。
- 将这些行附加到 df1,就在所述行的后面。
当然,我们可以讨论df1中重复的情况,以及你想如何处理它们。然后,我们可以编写两种解决方案,一种使用 for 循环,另一种使用 Pandas 中的函数式编程(取决于您的技能、习惯和其他偏好)。
一种for-loop方法
假设 df1 中没有重复项,则:
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
cols = ["col1", "col2", "col3"]
## create a new dataframe
new_df = pd.DataFrame()
## loop over all rows of df1
for _, row1 in df1.iterrows():
## check for equality with df2, we need all elements cols being equal to select these rows
is_eq_to_row_1 = df2.eq(row1.loc[cols]).loc[:, cols].all(axis=1)
## if at least one row of df2 is equal to row, append them
if is_eq_to_row_1.any():
## first append the row1
new_df = new_df.append(row1)
## then all rows of df2 equal to row1
new_df = new_df.append(df2[is_eq_to_row_1])
函数式方法
我还没有时间写一个合适的解决方案,但我猜它暗示了 groupby、apply 和一堆 Pandas 相关的功能。对于每一行 x,我们仍然使用 df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)]
到 select df2 的行等于 x。
我们只是“循环”遍历所有行。设计的工具可以是 groupby。那么我们就不再关心重复了。
new_df = df1.groupby(cols). \
apply(lambda x : pd.concat([x,
df2[df2.eq(x.iloc[0, :].loc[cols]). \
loc[:, cols].all(axis=1)]]))
还有一些工作要做,如果没有找到 df2 的行,则不追加行,并清理输出。
我问了一个非常相似的问题 here 但想知道如果必须依赖多列来执行追加,是否有办法解决这个问题。 所以数据帧如下所示,
import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
'col4': ['min', 'max', 'mid','min'],
'col5': ['min', 'max', 'max','max']}
d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
'col4': ['min', 'max', 'max'],
'col5': ['max', 'min', 'mid']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
所以这一次,我想将 df2 中的行附加到 df1 中的相似行下,前提是所有 col1、col2、col3 中的行都相似。所以输出是,
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
所以我尝试了以下方法,
df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)
但是我收到这个错误,
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
好的,我意识到自己的错误,并会 post 在这里回答,以防任何人都感兴趣,(答案基于问题中的 link)
print(pd.concat([df1, df2[df2.set_index(['col1','col2','col3']).index.isin(df1.set_index(['col1','col2','col3']).index)]]).sort_values(['col1','col2','col3'], ignore_index=True))
另一个解决方案是使用 pd.merge
和 pd.wide_to_long
:
out = (
pd.wide_to_long(
pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index(),
stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
.sort_index().reset_index(drop=True)[df1.columns]
)
输出:
>>> out
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
循序渐进
# Step 1: merge
>>> out = pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index()
index col1 col2 col3 col4_x col5_x col4_y col5_y
0 0 I ate dinner I ate dinner I went out min min min max
1 1 the play was inetresting the gold is shining the house is nice min max max mid
# Step 2: wide_to_long
>>> out = pd.wide_to_long(out, stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
col3 col2 col1 col4 col5
index val
0 x I went out I ate dinner I ate dinner min min
1 x the house is nice the gold is shining the play was inetresting min max
0 y I went out I ate dinner I ate dinner min max
1 y the house is nice the gold is shining the play was inetresting max mid
# Step 3: reorder dataframe
>>> out = out.sort_index().reset_index(drop=True)[df1.columns]
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
我强烈建议您提出数字最小示例,而不是基于文本的示例。更容易阅读,更容易理解。话虽这么说,如果我理解正确,你想要 df1 的每一行:
- 检查 df2 中是否有某些行在某些列上具有相同的值。
- 将这些行附加到 df1,就在所述行的后面。
当然,我们可以讨论df1中重复的情况,以及你想如何处理它们。然后,我们可以编写两种解决方案,一种使用 for 循环,另一种使用 Pandas 中的函数式编程(取决于您的技能、习惯和其他偏好)。
一种for-loop方法
假设 df1 中没有重复项,则:
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
cols = ["col1", "col2", "col3"]
## create a new dataframe
new_df = pd.DataFrame()
## loop over all rows of df1
for _, row1 in df1.iterrows():
## check for equality with df2, we need all elements cols being equal to select these rows
is_eq_to_row_1 = df2.eq(row1.loc[cols]).loc[:, cols].all(axis=1)
## if at least one row of df2 is equal to row, append them
if is_eq_to_row_1.any():
## first append the row1
new_df = new_df.append(row1)
## then all rows of df2 equal to row1
new_df = new_df.append(df2[is_eq_to_row_1])
函数式方法
我还没有时间写一个合适的解决方案,但我猜它暗示了 groupby、apply 和一堆 Pandas 相关的功能。对于每一行 x,我们仍然使用 df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)]
到 select df2 的行等于 x。
我们只是“循环”遍历所有行。设计的工具可以是 groupby。那么我们就不再关心重复了。
new_df = df1.groupby(cols). \
apply(lambda x : pd.concat([x,
df2[df2.eq(x.iloc[0, :].loc[cols]). \
loc[:, cols].all(axis=1)]]))
还有一些工作要做,如果没有找到 df2 的行,则不追加行,并清理输出。