如何使用 python 中另一个数据框中的列的重复值为唯一行的数据框子集?
How can I subset a data frame for unique rows using repeating values from a column in another data frame in python?
我有 2 个数据框。我想根据 df_2 对 df_1 进行子集化,以便生成的数据框中的行对应于 df_2 中的行。以下是两个示例数据框:
df_1 = pd.DataFrame({
"ID": ["Lemon","Banana","Apple","Cherry","Tomato","Blueberry","Avocado","Lime"],
"Color": ["Yellow","Yellow","Red","Red","Red","Blue","Green","Green"]})
df_2 = pd.DataFrame({"Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})
我想要的输出是 df_3,其中“颜色”列与 df_2 中的相同:
df_3 = pd.DataFrame({
"ID": ["Apple","Blueberry","Lemon","Avocado","Cherry","Banana"],
"Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})
当我合并 df_1 和 df_2 时,我得到了重复的行,因为 df_2 中的大多数行在 df_1.
中有多个匹配项
merged = df_2.merge(df_1, how="left", on="Color")
删除重复项对“黄色”颜色正常工作,因为它在 df_2 中的值和 df_1 中的选项有 2:2 比率,但它不适用于“红色”或“绿色”,因为它们分别具有 2:3 比率和 1:2 比率,导致额外的行。
no_duplicates = merged.drop_duplicates(subset = "ID")
有没有办法对 df_1 进行子集化,其中 df_2 中第一次出现的“Red”拉出 df_1 中第一次出现的“Red”,第二次出现df_2 中的“Red”拉出 df_1 中第二次出现的“Red”,等等?除非别无选择,否则我宁愿不使用循环。谢谢。
尝试向 df_1
和 df_2
添加一个指标列,同时使用 groupby cumcount
来获得位置:
df_1['i'] = df_1.groupby('Color').cumcount()
df_2['i'] = df_2.groupby('Color').cumcount()
df_1
:
ID Color i
0 Lemon Yellow 0
1 Banana Yellow 1
2 Apple Red 0
3 Cherry Red 1
4 Tomato Red 2
5 Blueberry Blue 0
6 Avocado Green 0
7 Lime Green 1
df_2
:
Color i
0 Red 0
1 Blue 0
2 Yellow 0
3 Green 0
4 Red 1
5 Yellow 1
然后merge
on both the indicator and the Color
then drop
指标栏:
merged_df = df_1.merge(df_2, how='right', on=['Color', 'i']).drop('i', axis=1)
merged_df
:
ID Color
0 Apple Red
1 Blueberry Blue
2 Lemon Yellow
3 Avocado Green
4 Cherry Red
5 Banana Yellow
或者创建将系列直接传递给 merge
(这使得 df_1
和 df_2
不受影响):
merged_df = df_1.merge(
df_2, how='right',
left_on=['Color', df_1.groupby('Color').cumcount()],
right_on=['Color', df_2.groupby('Color').cumcount()]
).drop('key_1', axis=1)
merged_df
:
ID Color
0 Apple Red
1 Blueberry Blue
2 Lemon Yellow
3 Avocado Green
4 Cherry Red
5 Banana Yellow
我有 2 个数据框。我想根据 df_2 对 df_1 进行子集化,以便生成的数据框中的行对应于 df_2 中的行。以下是两个示例数据框:
df_1 = pd.DataFrame({
"ID": ["Lemon","Banana","Apple","Cherry","Tomato","Blueberry","Avocado","Lime"],
"Color": ["Yellow","Yellow","Red","Red","Red","Blue","Green","Green"]})
df_2 = pd.DataFrame({"Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})
我想要的输出是 df_3,其中“颜色”列与 df_2 中的相同:
df_3 = pd.DataFrame({
"ID": ["Apple","Blueberry","Lemon","Avocado","Cherry","Banana"],
"Color": ["Red","Blue","Yellow","Green","Red","Yellow"]})
当我合并 df_1 和 df_2 时,我得到了重复的行,因为 df_2 中的大多数行在 df_1.
中有多个匹配项merged = df_2.merge(df_1, how="left", on="Color")
删除重复项对“黄色”颜色正常工作,因为它在 df_2 中的值和 df_1 中的选项有 2:2 比率,但它不适用于“红色”或“绿色”,因为它们分别具有 2:3 比率和 1:2 比率,导致额外的行。
no_duplicates = merged.drop_duplicates(subset = "ID")
有没有办法对 df_1 进行子集化,其中 df_2 中第一次出现的“Red”拉出 df_1 中第一次出现的“Red”,第二次出现df_2 中的“Red”拉出 df_1 中第二次出现的“Red”,等等?除非别无选择,否则我宁愿不使用循环。谢谢。
尝试向 df_1
和 df_2
添加一个指标列,同时使用 groupby cumcount
来获得位置:
df_1['i'] = df_1.groupby('Color').cumcount()
df_2['i'] = df_2.groupby('Color').cumcount()
df_1
:
ID Color i
0 Lemon Yellow 0
1 Banana Yellow 1
2 Apple Red 0
3 Cherry Red 1
4 Tomato Red 2
5 Blueberry Blue 0
6 Avocado Green 0
7 Lime Green 1
df_2
:
Color i
0 Red 0
1 Blue 0
2 Yellow 0
3 Green 0
4 Red 1
5 Yellow 1
然后merge
on both the indicator and the Color
then drop
指标栏:
merged_df = df_1.merge(df_2, how='right', on=['Color', 'i']).drop('i', axis=1)
merged_df
:
ID Color
0 Apple Red
1 Blueberry Blue
2 Lemon Yellow
3 Avocado Green
4 Cherry Red
5 Banana Yellow
或者创建将系列直接传递给 merge
(这使得 df_1
和 df_2
不受影响):
merged_df = df_1.merge(
df_2, how='right',
left_on=['Color', df_1.groupby('Color').cumcount()],
right_on=['Color', df_2.groupby('Color').cumcount()]
).drop('key_1', axis=1)
merged_df
:
ID Color
0 Apple Red
1 Blueberry Blue
2 Lemon Yellow
3 Avocado Green
4 Cherry Red
5 Banana Yellow