合并两个数据帧而不重复值
Merging two dataframes without repeating values
我有两个包含唯一 ID 的数据框,一个评论列(一个是正面的,另一个是负面的)和一个评级列(同样一个是正面的,一个是负面的):
df1:
unique_id
pos_review
pos_rating
1
"Great, would recommend...
8
1
"Really cool, you should go...
7
2
"I had a great time, you..
9
3
"Good way to spend your night...
8
4
"I might go again for how good it was...
9
df2:
unique_id
neg_review
neg_rating
1
"Really boring...
4
2
"I'll never try this again...
2
2
"I would not recommend..
3
3
"Could have been better...
4
3
"No one should ever go...
1
我正在尝试将两者结合起来,以便唯一 ID 可以排列,但是如果一个评论比另一个评论多,则缺少的评论将导致 NaN 值,我稍后将其替换为“否”审查”。所以理想情况下我会得到:
df3:
unique_id
pos_review
pos_rating
neg_review
neg_rating
1
"Great, would recommend...
8
"Really boring...
4
1
"Really cool, you should go...
7
NaN
NaN
2
"I had a great time, you..
9
"I'll never try this again...
2
2
NaN
NaN
"I would not recommend..
3
3
"Good way to spend your night...
8
"Could have been better...
4
3
NaN
NaN
"No one should ever go...
1
4
"I might go again for how good it was...
9
NaN
NaN
我试过使用 df3 = df1.merge(df2, on='unique_id', how='inner')
,但这只是对我的 df2 中的每条评论重复我的 df1 的第一次评论,就像这样(查看下面的 unique_id 2):
unique_id
pos_review
pos_rating
neg_review
neg_rating
1
"Great, would recommend...
8
"Really boring...
4
1
"Really cool, you should go...
7
NaN
NaN
2
"I had a great time, you..
9
"I'll never try this again...
2
2
"I had a great time, you..
9
"I would not recommend..
3
3
"Good way to spend your night...
8
"Could have been better...
4
3
NaN
NaN
"No one should ever go...
1
4
"I might go again for how good it was...
9
NaN
NaN
关于如何获得上述 df3 的任何想法?
将 inner
更改为 outer
,并使用 cumcount
创建子键
df1['key'] = df1.groupby('unique_id').cumcount()
df2['key'] = df2.groupby('unique_id').cumcount()
df3 = df1.merge(df2,on = ['unique_id','key'],how='outer').sort_values('unique_id')
输出[134]:
unique_id pos_review pos_rating key neg_review neg_rating
0 1 reatwouldrecommend... 8.0 0 Really 4.0
1 1 Really 7.0 1 NaN NaN
2 2 had 9.0 0 I'll 2.0
5 2 NaN NaN 1 I 3.0
3 3 Good 8.0 0 Could 4.0
6 3 NaN NaN 1 No 1.0
4 4 I 9.0 0 NaN NaN
# you can also drop the key column with df3 = df3.drop(['key'],axis=1)
您需要一个累积计数器作为 groupby
的第二个参数。
df3 = pd.merge(
df1,df2,
left_on=['unique_id',df1.groupby('unique_id').cumcount()],
right_on=['unique_id',df2.groupby('unique_id').cumcount()],
how='outer')
提供了预期的结果
在@HenryEcker 指出 append
将被折旧后更新。
我会使用 pd.concat
而不是 DataFrame.merge
,因为 'unique_id' 在 table 值的意义上实际上并不是唯一的。
df3 = pd.concat([df1, df2], ignore_index=True)
Mabye merge
混淆了你对输出 table 应该是什么的感觉。我认为您的理想 df3 示例需要包含带有 NaNs
的额外行
例如对于 unique_id = 1
你应该有三行:
- 负数列中有 NaN 的两个
- 一个在正列中有 NaNs
我不确定为什么您只对 unique_id = 1 的一行分配差评,而不对其他行分配差评。最好只保留所有行并在所有适当的地方使用 NaN
然后如果你想聚合使用DataFrame.groupby
。例如。平均评分
grouped_mean = df3.groupby('unique_id').mean()
请注意,这将为您提供一个新的 df,其中包含负面评论的平均值和正面评论的平均值,因为它们位于 df3
中的不同列中
我有两个包含唯一 ID 的数据框,一个评论列(一个是正面的,另一个是负面的)和一个评级列(同样一个是正面的,一个是负面的):
df1:
unique_id | pos_review | pos_rating |
---|---|---|
1 | "Great, would recommend... | 8 |
1 | "Really cool, you should go... | 7 |
2 | "I had a great time, you.. | 9 |
3 | "Good way to spend your night... | 8 |
4 | "I might go again for how good it was... | 9 |
df2:
unique_id | neg_review | neg_rating |
---|---|---|
1 | "Really boring... | 4 |
2 | "I'll never try this again... | 2 |
2 | "I would not recommend.. | 3 |
3 | "Could have been better... | 4 |
3 | "No one should ever go... | 1 |
我正在尝试将两者结合起来,以便唯一 ID 可以排列,但是如果一个评论比另一个评论多,则缺少的评论将导致 NaN 值,我稍后将其替换为“否”审查”。所以理想情况下我会得到:
df3:
unique_id | pos_review | pos_rating | neg_review | neg_rating |
---|---|---|---|---|
1 | "Great, would recommend... | 8 | "Really boring... | 4 |
1 | "Really cool, you should go... | 7 | NaN | NaN |
2 | "I had a great time, you.. | 9 | "I'll never try this again... | 2 |
2 | NaN | NaN | "I would not recommend.. | 3 |
3 | "Good way to spend your night... | 8 | "Could have been better... | 4 |
3 | NaN | NaN | "No one should ever go... | 1 |
4 | "I might go again for how good it was... | 9 | NaN | NaN |
我试过使用 df3 = df1.merge(df2, on='unique_id', how='inner')
,但这只是对我的 df2 中的每条评论重复我的 df1 的第一次评论,就像这样(查看下面的 unique_id 2):
unique_id | pos_review | pos_rating | neg_review | neg_rating |
---|---|---|---|---|
1 | "Great, would recommend... | 8 | "Really boring... | 4 |
1 | "Really cool, you should go... | 7 | NaN | NaN |
2 | "I had a great time, you.. | 9 | "I'll never try this again... | 2 |
2 | "I had a great time, you.. | 9 | "I would not recommend.. | 3 |
3 | "Good way to spend your night... | 8 | "Could have been better... | 4 |
3 | NaN | NaN | "No one should ever go... | 1 |
4 | "I might go again for how good it was... | 9 | NaN | NaN |
关于如何获得上述 df3 的任何想法?
将 inner
更改为 outer
,并使用 cumcount
df1['key'] = df1.groupby('unique_id').cumcount()
df2['key'] = df2.groupby('unique_id').cumcount()
df3 = df1.merge(df2,on = ['unique_id','key'],how='outer').sort_values('unique_id')
输出[134]:
unique_id pos_review pos_rating key neg_review neg_rating
0 1 reatwouldrecommend... 8.0 0 Really 4.0
1 1 Really 7.0 1 NaN NaN
2 2 had 9.0 0 I'll 2.0
5 2 NaN NaN 1 I 3.0
3 3 Good 8.0 0 Could 4.0
6 3 NaN NaN 1 No 1.0
4 4 I 9.0 0 NaN NaN
# you can also drop the key column with df3 = df3.drop(['key'],axis=1)
您需要一个累积计数器作为 groupby
的第二个参数。
df3 = pd.merge(
df1,df2,
left_on=['unique_id',df1.groupby('unique_id').cumcount()],
right_on=['unique_id',df2.groupby('unique_id').cumcount()],
how='outer')
提供了预期的结果
在@HenryEcker 指出 append
将被折旧后更新。
我会使用 pd.concat
而不是 DataFrame.merge
,因为 'unique_id' 在 table 值的意义上实际上并不是唯一的。
df3 = pd.concat([df1, df2], ignore_index=True)
Mabye merge
混淆了你对输出 table 应该是什么的感觉。我认为您的理想 df3 示例需要包含带有 NaNs
例如对于 unique_id = 1
你应该有三行:
- 负数列中有 NaN 的两个
- 一个在正列中有 NaNs
我不确定为什么您只对 unique_id = 1 的一行分配差评,而不对其他行分配差评。最好只保留所有行并在所有适当的地方使用 NaN
然后如果你想聚合使用DataFrame.groupby
。例如。平均评分
grouped_mean = df3.groupby('unique_id').mean()
请注意,这将为您提供一个新的 df,其中包含负面评论的平均值和正面评论的平均值,因为它们位于 df3
中的不同列中