基于 pandas 中的模糊匹配删除重复项
Deleting duplicates based on fuzzy matching in pandas
我有一个包含人员信息的 DataFrame,但有重复的行,地址略有不同。
我如何根据模糊匹配或其他检测相似性的方法删除重复项,但确保只有名字和姓氏也匹配时才会删除具有相似地址的行?
示例数据:
First name | Last name | Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
4 Mary Jane PEP, 9-2
5 Gary Young verylongstreetname 1
6 Gary Young 1 verylongstretname
(故意打错字)
示例数据代码:
df = pd.DataFrame([
['John', 'Doe', 'ABC 9'],
['John', 'Doe', 'KFT 2'],
['Michael', 'John', 'ABC 9'],
['Mary', 'Jane', 'PEP 9/2'],
['Mary', 'Jane', 'PEP, 9-2'],
['Gary', 'Young', 'verylongstreetname 1'],
['Gary', 'Young', '1 verylongstretname']
], columns=['First name', 'Last name', 'Address'])
预期输出:
First name | Last name | Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
4 Gary Young verylongstreetname 1
使用str.replace
删除所有非单词字符然后drop_duplicates
df['Address'] = df['Address'].str.replace(r'\W','')
temp_address = df['Address']
df.drop_duplicates(inplace=True)
输出
First name Last name Address
0 John Doe ABC9
1 John Doe KFT2
2 Michael John ABC9
3 Mary Jane PEP92
替换原始地址
b['Address'] = b['Address'].apply(lambda x: [w for w in temp_address if w.split(' ')[0] in x][0])
输出
First name Last name Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
好的,这是一个方法
df['Address'] = df['Address'].str.replace(r'\W',' ') # giving a space
def check_simi(d):
temp = []
flag = 0
for w in d:
temp.extend(w.split(' '))
temp = [t for t in temp if t]
flag = len(temp) / 2
if len(set(temp)) == flag:
return int(d.index[0])
else:
indexes = df.groupby(['First name','Last name'])['Address'].apply(check_simi)
indexes = [int(i) for i in indexes if i >= 0]
df.drop(indexes)
First name Last name Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
4 Mary Jane PEP 9 2
6 Gary Young 1 verylongstreetname
PS - 请查看 https://github.com/seatgeek/fuzzywuzzy 以获得更简洁的方法,我没有这样做,因为我的网络不允许这样做
已解决。
基于@iamklaus anwser 我做了这段代码:
def remove_duplicates_inplace(df, groupby=[], similarity_field='', similar_level=85):
def check_simi(d):
dupl_indexes = []
for i in range(len(d.values) - 1):
for j in range(i + 1, len(d.values)):
if fuzz.token_sort_ratio(d.values[i], d.values[j]) >= similar_level:
dupl_indexes.append(d.index[j])
return dupl_indexes
indexes = df.groupby(groupby)[similarity_field].apply(check_simi)
for index_list in indexes:
df.drop(index_list, inplace=True)
remove_duplicates_inplace(df, groupby=['firstname', 'lastname'], similarity_field='address')
输出:
firstname lastname address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
5 Gary Young verylongstreetname 1
我有一个包含人员信息的 DataFrame,但有重复的行,地址略有不同。
我如何根据模糊匹配或其他检测相似性的方法删除重复项,但确保只有名字和姓氏也匹配时才会删除具有相似地址的行?
示例数据:
First name | Last name | Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
4 Mary Jane PEP, 9-2
5 Gary Young verylongstreetname 1
6 Gary Young 1 verylongstretname
(故意打错字)
示例数据代码:
df = pd.DataFrame([
['John', 'Doe', 'ABC 9'],
['John', 'Doe', 'KFT 2'],
['Michael', 'John', 'ABC 9'],
['Mary', 'Jane', 'PEP 9/2'],
['Mary', 'Jane', 'PEP, 9-2'],
['Gary', 'Young', 'verylongstreetname 1'],
['Gary', 'Young', '1 verylongstretname']
], columns=['First name', 'Last name', 'Address'])
预期输出:
First name | Last name | Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
4 Gary Young verylongstreetname 1
使用str.replace
删除所有非单词字符然后drop_duplicates
df['Address'] = df['Address'].str.replace(r'\W','')
temp_address = df['Address']
df.drop_duplicates(inplace=True)
输出
First name Last name Address
0 John Doe ABC9
1 John Doe KFT2
2 Michael John ABC9
3 Mary Jane PEP92
替换原始地址
b['Address'] = b['Address'].apply(lambda x: [w for w in temp_address if w.split(' ')[0] in x][0])
输出
First name Last name Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
好的,这是一个方法
df['Address'] = df['Address'].str.replace(r'\W',' ') # giving a space
def check_simi(d):
temp = []
flag = 0
for w in d:
temp.extend(w.split(' '))
temp = [t for t in temp if t]
flag = len(temp) / 2
if len(set(temp)) == flag:
return int(d.index[0])
else:
indexes = df.groupby(['First name','Last name'])['Address'].apply(check_simi)
indexes = [int(i) for i in indexes if i >= 0]
df.drop(indexes)
First name Last name Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
4 Mary Jane PEP 9 2
6 Gary Young 1 verylongstreetname
PS - 请查看 https://github.com/seatgeek/fuzzywuzzy 以获得更简洁的方法,我没有这样做,因为我的网络不允许这样做
已解决。
基于@iamklaus anwser 我做了这段代码:
def remove_duplicates_inplace(df, groupby=[], similarity_field='', similar_level=85):
def check_simi(d):
dupl_indexes = []
for i in range(len(d.values) - 1):
for j in range(i + 1, len(d.values)):
if fuzz.token_sort_ratio(d.values[i], d.values[j]) >= similar_level:
dupl_indexes.append(d.index[j])
return dupl_indexes
indexes = df.groupby(groupby)[similarity_field].apply(check_simi)
for index_list in indexes:
df.drop(index_list, inplace=True)
remove_duplicates_inplace(df, groupby=['firstname', 'lastname'], similarity_field='address')
输出:
firstname lastname address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
5 Gary Young verylongstreetname 1