合并两个数据帧而不重复值

Merging two dataframes without repeating values

我有两个包含唯一 ID 的数据框,一个评论列(一个是正面的,另一个是负面的)和一个评级列(同样一个是正面的,一个是负面的):

df1:

unique_id pos_review pos_rating
1 "Great, would recommend... 8
1 "Really cool, you should go... 7
2 "I had a great time, you.. 9
3 "Good way to spend your night... 8
4 "I might go again for how good it was... 9

df2:

unique_id neg_review neg_rating
1 "Really boring... 4
2 "I'll never try this again... 2
2 "I would not recommend.. 3
3 "Could have been better... 4
3 "No one should ever go... 1

我正在尝试将两者结合起来,以便唯一 ID 可以排列,但是如果一个评论比另一个评论多,则缺少的评论将导致 NaN 值,我稍后将其替换为“否”审查”。所以理想情况下我会得到:

df3:

unique_id pos_review pos_rating neg_review neg_rating
1 "Great, would recommend... 8 "Really boring... 4
1 "Really cool, you should go... 7 NaN NaN
2 "I had a great time, you.. 9 "I'll never try this again... 2
2 NaN NaN "I would not recommend.. 3
3 "Good way to spend your night... 8 "Could have been better... 4
3 NaN NaN "No one should ever go... 1
4 "I might go again for how good it was... 9 NaN NaN

我试过使用 df3 = df1.merge(df2, on='unique_id', how='inner'),但这只是对我的 df2 中的每条评论重复我的 df1 的第一次评论,就像这样(查看下面的 unique_id 2):

unique_id pos_review pos_rating neg_review neg_rating
1 "Great, would recommend... 8 "Really boring... 4
1 "Really cool, you should go... 7 NaN NaN
2 "I had a great time, you.. 9 "I'll never try this again... 2
2 "I had a great time, you.. 9 "I would not recommend.. 3
3 "Good way to spend your night... 8 "Could have been better... 4
3 NaN NaN "No one should ever go... 1
4 "I might go again for how good it was... 9 NaN NaN

关于如何获得上述 df3 的任何想法?

inner 更改为 outer ,并使用 cumcount

创建子键
df1['key'] = df1.groupby('unique_id').cumcount()
df2['key'] = df2.groupby('unique_id').cumcount()
df3 = df1.merge(df2,on = ['unique_id','key'],how='outer').sort_values('unique_id')

输出[134]:

   unique_id             pos_review  pos_rating  key neg_review  neg_rating
0          1  reatwouldrecommend...         8.0    0    Really         4.0
1          1                 Really         7.0    1       NaN         NaN
2          2                    had         9.0    0      I'll         2.0
5          2                    NaN         NaN    1         I         3.0
3          3                   Good         8.0    0     Could         4.0
6          3                    NaN         NaN    1        No         1.0
4          4                      I         9.0    0       NaN         NaN

# you can also drop the key column with df3 = df3.drop(['key'],axis=1)

您需要一个累积计数器作为 groupby 的第二个参数。

df3 = pd.merge(
    df1,df2, 
    left_on=['unique_id',df1.groupby('unique_id').cumcount()],
    right_on=['unique_id',df2.groupby('unique_id').cumcount()],
    how='outer')

提供了预期的结果

在@HenryEcker 指出 append 将被折旧后更新。

我会使用 pd.concat 而不是 DataFrame.merge,因为 'unique_id' 在 table 值的意义上实际上并不是唯一的。

df3 = pd.concat([df1, df2], ignore_index=True)

Mabye merge 混淆了你对输出 table 应该是什么的感觉。我认为您的理想 df3 示例需要包含带有 NaNs

的额外行

例如对于 unique_id = 1 你应该有三行:

  • 负数列中有 NaN 的两个
  • 一个在正列中有 NaNs

我不确定为什么您只对 unique_id = 1 的一行分配差评,而不对其他行分配差评。最好只保留所有行并在所有适当的地方使用 NaN

然后如果你想聚合使用DataFrame.groupby。例如。平均评分

grouped_mean = df3.groupby('unique_id').mean()

请注意,这将为您提供一个新的 df,其中包含负面评论的平均值和正面评论的平均值,因为它们位于 df3

中的不同列中