pandas 数据帧之间的 if 和语句
if and statement between to pandas dataframes
我有 2 个数据集,使用来自 df1 的数据我想使用 4 个条件识别 df2 中的重复数据。
- 条件:
如果 df1 'Name' 列的一行与 df2 'Name' 列的任何一行匹配超过 80%
(与)
(df1['Class'] == df2['Class'] (或) df1['Amt $'] == df2['Amt $'])
(与)
如果 df1 中 'Category' 列的行与 df2 'Category' 列中的任何行项目匹配超过 80%
- 结果:
如果满足所有条件,则仅保留 df2 中的新数据并删除其他行。
df1
Name Class Amt $ Category
Apple 1 5 Fruit
Banana 2 8 Fruit
Cat 3 4 Animal
df2
Index Name Class Amt $ Category
1 Apple is Red 1 5 Fruit
2 Banana 2 8 fruits
3 Cat is cute 3 4 animals
4 Green Apple 1 5 fruis
5 Banana is Yellow 2 8 fruet
6 Cat 3 4 anemal
7 Apple 1 5 anemal
8 Ripe Banana 2 8 frut
9 Royal Gala Apple 1 5 Fruit
10 Cats 3 4 animol
11 Green Banana 2 8 Fruit
12 Green Apple 1 5 fruits
13 White Cat 3 4 Animal
14 Banana is sweet 2 8 appel
15 Apple is Red 1 5 fruits
16 Ginger Cat 3 4 fruits
17 Cat house 3 4 animals
18 Royal Gala Apple 1 5 fret
19 Banana is Yellow 2 8 fruit market
20 Cat is cute 3 4 anemal
- 我试过的代码:
for i in df1['Name']:
for u in df2['Name']:
for k in df1['Class']:
for l in df2['Class']:
for m in df1['Amt $']:
for n in df2['Amt $']:
for o in df1['Category']:
for p in df2['Category']:
if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
print(i, u)
所需的输出数据帧应该像这样:
Name Class Amt $ Category
Apple is Red 1 5 Fruit
Banana 2 8 fruits
Cat is cute 3 4 animals
Green Apple 1 5 fruis
Banana is Yellow 2 8 fruet
Cat 3 4 anemal
Ripe Banana 2 8 frut
Royal Gala Apple 1 5 Fruit
Cats 3 4 animol
Green Banana 2 8 Fruit
Green Apple 1 5 fruits
White Cat 3 4 Animal
Apple is Red 1 5 fruits
Cat house 3 4 animals
Banana is Yellow 2 8 fruit market
Cat is cute 3 4 anemal
请帮我找到最好的解决方案! :)
首先,您必须遍历两个 dfs 并使用条件进行匹配,并在 df2 中设置一个变量。
df2['match'] = False
for idx2, row2 in df2.iterrows():
match = False
for idx1, row1 in df1.iterrows():
if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.8 and \
(SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.8 and \
(row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
match = True
break
df2.at[idx2, 'match'] = match
一旦你有了匹配项,你就可以从匹配项中删除重复项 df2['match']==True
。
df2[df2['match']==True].drop_duplicates(keep='first')
接下来您可以将上述结果与不匹配项合并df2['match']==False
df2[df2['match']==False].append(df2[df2['match']==True].drop_duplicates(keep='first'))
这里我假设您要删除直接重复项。是根据条件去重还是直接去重?
根据你这里的测试数据集,'Apple' 和 'Apple is red' 是 80% 匹配。但是 SequenceMatcher(None, 'Apple', 'Apple is Red').ratio()
只给出 0.5882352941176471。同样,SequenceMatcher(None, 'Fruit', 'fruits').ratio()
仅为 0.7272727272727273。你还期待这里有什么吗?还是预期的结果不对?
无论如何,我希望这能让您对方法有所了解。
编辑 1 如果你想获得匹配 df1['Name']
。
我只将 df2['match']
重置为字符串而不是布尔值,并将 df1['Name']
分配给 df2['match']
而不是将其分配给 True
。然后在最后的 df 中,我将 df2
具有 df2['match']==False
的行和 df2['match']==True
的非重复行连接起来。希望这可以帮助。
df2['match'] = ''
for idx2, row2 in df2.iterrows():
match = ''
for idx1, row1 in df1.iterrows():
if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.5 and \
(SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.5 and \
(row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
match = row1['Name']
break
df2.at[idx2, 'match'] = match
print(df2[df2['match']==''].append(df2[df2['match']!=''].drop_duplicates(keep='first')))
我有 2 个数据集,使用来自 df1 的数据我想使用 4 个条件识别 df2 中的重复数据。
- 条件:
如果 df1 'Name' 列的一行与 df2 'Name' 列的任何一行匹配超过 80%
(与)
(df1['Class'] == df2['Class'] (或) df1['Amt $'] == df2['Amt $'])
(与)
如果 df1 中 'Category' 列的行与 df2 'Category' 列中的任何行项目匹配超过 80%
- 结果:
如果满足所有条件,则仅保留 df2 中的新数据并删除其他行。
df1
Name Class Amt $ Category
Apple 1 5 Fruit
Banana 2 8 Fruit
Cat 3 4 Animal
df2
Index Name Class Amt $ Category
1 Apple is Red 1 5 Fruit
2 Banana 2 8 fruits
3 Cat is cute 3 4 animals
4 Green Apple 1 5 fruis
5 Banana is Yellow 2 8 fruet
6 Cat 3 4 anemal
7 Apple 1 5 anemal
8 Ripe Banana 2 8 frut
9 Royal Gala Apple 1 5 Fruit
10 Cats 3 4 animol
11 Green Banana 2 8 Fruit
12 Green Apple 1 5 fruits
13 White Cat 3 4 Animal
14 Banana is sweet 2 8 appel
15 Apple is Red 1 5 fruits
16 Ginger Cat 3 4 fruits
17 Cat house 3 4 animals
18 Royal Gala Apple 1 5 fret
19 Banana is Yellow 2 8 fruit market
20 Cat is cute 3 4 anemal
- 我试过的代码:
for i in df1['Name']:
for u in df2['Name']:
for k in df1['Class']:
for l in df2['Class']:
for m in df1['Amt $']:
for n in df2['Amt $']:
for o in df1['Category']:
for p in df2['Category']:
if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
print(i, u)
所需的输出数据帧应该像这样:
Name Class Amt $ Category
Apple is Red 1 5 Fruit
Banana 2 8 fruits
Cat is cute 3 4 animals
Green Apple 1 5 fruis
Banana is Yellow 2 8 fruet
Cat 3 4 anemal
Ripe Banana 2 8 frut
Royal Gala Apple 1 5 Fruit
Cats 3 4 animol
Green Banana 2 8 Fruit
Green Apple 1 5 fruits
White Cat 3 4 Animal
Apple is Red 1 5 fruits
Cat house 3 4 animals
Banana is Yellow 2 8 fruit market
Cat is cute 3 4 anemal
请帮我找到最好的解决方案! :)
首先,您必须遍历两个 dfs 并使用条件进行匹配,并在 df2 中设置一个变量。
df2['match'] = False
for idx2, row2 in df2.iterrows():
match = False
for idx1, row1 in df1.iterrows():
if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.8 and \
(SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.8 and \
(row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
match = True
break
df2.at[idx2, 'match'] = match
一旦你有了匹配项,你就可以从匹配项中删除重复项 df2['match']==True
。
df2[df2['match']==True].drop_duplicates(keep='first')
接下来您可以将上述结果与不匹配项合并df2['match']==False
df2[df2['match']==False].append(df2[df2['match']==True].drop_duplicates(keep='first'))
这里我假设您要删除直接重复项。是根据条件去重还是直接去重?
根据你这里的测试数据集,'Apple' 和 'Apple is red' 是 80% 匹配。但是 SequenceMatcher(None, 'Apple', 'Apple is Red').ratio()
只给出 0.5882352941176471。同样,SequenceMatcher(None, 'Fruit', 'fruits').ratio()
仅为 0.7272727272727273。你还期待这里有什么吗?还是预期的结果不对?
无论如何,我希望这能让您对方法有所了解。
编辑 1 如果你想获得匹配 df1['Name']
。
我只将 df2['match']
重置为字符串而不是布尔值,并将 df1['Name']
分配给 df2['match']
而不是将其分配给 True
。然后在最后的 df 中,我将 df2
具有 df2['match']==False
的行和 df2['match']==True
的非重复行连接起来。希望这可以帮助。
df2['match'] = ''
for idx2, row2 in df2.iterrows():
match = ''
for idx1, row1 in df1.iterrows():
if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.5 and \
(SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.5 and \
(row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
match = row1['Name']
break
df2.at[idx2, 'match'] = match
print(df2[df2['match']==''].append(df2[df2['match']!=''].drop_duplicates(keep='first')))