合并两个数据帧而不重复值

Question

我有两个包含唯一 ID 的数据框，一个评论列（一个是正面的，另一个是负面的）和一个评级列（同样一个是正面的，一个是负面的）：

df1:

unique_id	pos_review	pos_rating
1	"Great, would recommend...	8
1	"Really cool, you should go...	7
2	"I had a great time, you..	9
3	"Good way to spend your night...	8
4	"I might go again for how good it was...	9

df2:

unique_id	neg_review	neg_rating
1	"Really boring...	4
2	"I'll never try this again...	2
2	"I would not recommend..	3
3	"Could have been better...	4
3	"No one should ever go...	1

我正在尝试将两者结合起来，以便唯一 ID 可以排列，但是如果一个评论比另一个评论多，则缺少的评论将导致 NaN 值，我稍后将其替换为“否”审查”。所以理想情况下我会得到：

df3:

unique_id	pos_review	pos_rating	neg_review	neg_rating
1	"Great, would recommend...	8	"Really boring...	4
1	"Really cool, you should go...	7	NaN	NaN
2	"I had a great time, you..	9	"I'll never try this again...	2
2	NaN	NaN	"I would not recommend..	3
3	"Good way to spend your night...	8	"Could have been better...	4
3	NaN	NaN	"No one should ever go...	1
4	"I might go again for how good it was...	9	NaN	NaN

我试过使用 df3 = df1.merge(df2, on='unique_id', how='inner')，但这只是对我的 df2 中的每条评论重复我的 df1 的第一次评论，就像这样（查看下面的 unique_id 2）：

unique_id	pos_review	pos_rating	neg_review	neg_rating
1	"Great, would recommend...	8	"Really boring...	4
1	"Really cool, you should go...	7	NaN	NaN
2	"I had a great time, you..	9	"I'll never try this again...	2
2	"I had a great time, you..	9	"I would not recommend..	3
3	"Good way to spend your night...	8	"Could have been better...	4
3	NaN	NaN	"No one should ever go...	1
4	"I might go again for how good it was...	9	NaN	NaN

关于如何获得上述 df3 的任何想法？

Answer 1

将 inner 更改为 outer ，并使用 cumcount

创建子键

df1['key'] = df1.groupby('unique_id').cumcount()
df2['key'] = df2.groupby('unique_id').cumcount()
df3 = df1.merge(df2,on = ['unique_id','key'],how='outer').sort_values('unique_id')

输出[134]:

   unique_id             pos_review  pos_rating  key neg_review  neg_rating
0          1  reatwouldrecommend...         8.0    0    Really         4.0
1          1                 Really         7.0    1       NaN         NaN
2          2                    had         9.0    0      I'll         2.0
5          2                    NaN         NaN    1         I         3.0
3          3                   Good         8.0    0     Could         4.0
6          3                    NaN         NaN    1        No         1.0
4          4                      I         9.0    0       NaN         NaN

# you can also drop the key column with df3 = df3.drop(['key'],axis=1)

Answer 2

您需要一个累积计数器作为 groupby 的第二个参数。

df3 = pd.merge(
    df1,df2, 
    left_on=['unique_id',df1.groupby('unique_id').cumcount()],
    right_on=['unique_id',df2.groupby('unique_id').cumcount()],
    how='outer')

提供了预期的结果

Answer 3

在@HenryEcker 指出 append 将被折旧后更新。

我会使用 pd.concat 而不是 DataFrame.merge，因为 'unique_id' 在 table 值的意义上实际上并不是唯一的。

df3 = pd.concat([df1, df2], ignore_index=True)

Mabye merge 混淆了你对输出 table 应该是什么的感觉。我认为您的理想 df3 示例需要包含带有 NaNs

的额外行

例如对于 unique_id = 1 你应该有三行：

负数列中有 NaN 的两个
一个在正列中有 NaNs

我不确定为什么您只对 unique_id = 1 的一行分配差评，而不对其他行分配差评。最好只保留所有行并在所有适当的地方使用 NaN

然后如果你想聚合使用DataFrame.groupby。例如。平均评分

grouped_mean = df3.groupby('unique_id').mean()

请注意，这将为您提供一个新的 df，其中包含负面评论的平均值和正面评论的平均值，因为它们位于 df3

中的不同列中

合并两个数据帧而不重复值

Merging two dataframes without repeating values

python

merge

dataframe

pandas