Python Pandas - 模糊重复匹配
Python Pandas - Fuzzy duplicates matching
我有一个这样的数据框:
make model
0 allard K1
1 alllard J2
2 alpine renault A110
3 alpine renualt A310
4 amc (rambler American
5 amc (rambler) Marlin
6 aries 1907
7 ariès 1932
8 austin healey 3000
9 austin-healey Sprite
10 benjamin et benova Type B3
11 benjamin/benova Type P2
12 benjmin/benova Type P3
目标是让第三列包含具有最高模糊率(最接近的模糊匹配)的行的索引。
我怎样才能有效地比较行?
使用fuzzywuzzy
,假设模糊度应该与make
列相匹配,你可以试试:
import pandas as pd
from itertools import product
from fuzzywuzzy.fuzz import ratio
df = pd.read_csv('data.csv')
keys = list(set(df['make']))
ratios = pd.DataFrame([{'k1': k1, 'k2': k2, 'ratio': ratio(k1, k2)} for k1, k2 in product(keys, keys) if k1 != k2])
def find_closest(make):
return df[df['make'] == ratios.loc[ratios[ratios['k1'] == make]['ratio'].argmax(), 'k2']].index.values[0]
df['closest_index'] = df['make'].apply(find_closest)
print(df)
您的数据输出:
make model closest_index
0 allard K1 1
1 alllard J2 0
2 alpine renault A110 3
3 alpine renualt A310 2
4 amc (rambler American 5
5 amc (rambler) Marlin 4
6 aries 1907 7
7 ariès 1932 6
8 austin healey 3000 9
9 austin-healey Sprite 8
10 benjamin et benova Type B3 11
11 benjamin/benova Type P2 12
12 benjmin/benova Type P3 11
我有一个这样的数据框:
make model
0 allard K1
1 alllard J2
2 alpine renault A110
3 alpine renualt A310
4 amc (rambler American
5 amc (rambler) Marlin
6 aries 1907
7 ariès 1932
8 austin healey 3000
9 austin-healey Sprite
10 benjamin et benova Type B3
11 benjamin/benova Type P2
12 benjmin/benova Type P3
目标是让第三列包含具有最高模糊率(最接近的模糊匹配)的行的索引。
我怎样才能有效地比较行?
使用fuzzywuzzy
,假设模糊度应该与make
列相匹配,你可以试试:
import pandas as pd
from itertools import product
from fuzzywuzzy.fuzz import ratio
df = pd.read_csv('data.csv')
keys = list(set(df['make']))
ratios = pd.DataFrame([{'k1': k1, 'k2': k2, 'ratio': ratio(k1, k2)} for k1, k2 in product(keys, keys) if k1 != k2])
def find_closest(make):
return df[df['make'] == ratios.loc[ratios[ratios['k1'] == make]['ratio'].argmax(), 'k2']].index.values[0]
df['closest_index'] = df['make'].apply(find_closest)
print(df)
您的数据输出:
make model closest_index
0 allard K1 1
1 alllard J2 0
2 alpine renault A110 3
3 alpine renualt A310 2
4 amc (rambler American 5
5 amc (rambler) Marlin 4
6 aries 1907 7
7 ariès 1932 6
8 austin healey 3000 9
9 austin-healey Sprite 8
10 benjamin et benova Type B3 11
11 benjamin/benova Type P2 12
12 benjmin/benova Type P3 11