模糊匹配和分组
Fuzzy matching and grouping
我正在尝试在多个字段上使用 Python 进行模糊匹配和分组。我想在不同的模糊阈值上对每一列进行比较。我尝试搜索 google 但找不到任何可以执行重复数据删除然后在不同列上创建组的解决方案。
输入:
Name
Address
Robert
9185 Pumpkin Hill St.
Rob
9185 Pumpkin Hill Street
Mike
1296 Tunnel St.
Mike
Tunnel Street 1296
John
6200 Beechwood Drive
输出:
Group ID
Name
Address
1
Robert
9185 Pumpkin Hill St.
1
Rob
9185 Pumpkin Hill Street
2
Mike
1296 Tunnel St.
2
Mike
Tunnel Street 1296
3
John
6200 Beechwood Drive
我建议查看 Levenstein 距离,因为这是识别相似字符串的常用算法。图书馆 FuzzWuzzy(我知道的愚蠢的名字)用 3 种不同的方法实现它。请参阅此 article 了解更多信息
这里是将每个字符串与其他每个字符串进行比较的起点。你提到有不同的阈值,所以所有需要做的就是遍历 l_match
并根据你想要的阈值
对它们进行分组
#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz
l_data =[
['Robert','9185 Pumpkin Hill St.']
,['Rob','9185 Pumpkin Hill Street']
,['Mike','1296 Tunnel St.']
,['Mike','Tunnel Street 1296']
,['John','6200 Beechwood Drive']
]
l_match = []
#loop through data
for idx1,row1 in enumerate(l_data):
#compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
for idx2,row2 in enumerate(l_data[idx1+1:]):
#Calculates index in original array for row2
origIdx=idx1+idx2+1
l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])
#Print raw data with index
for idx,val in enumerate(l_data):
print(f'{idx}-{val}')
print ("*" * 100)
#Print results of comparison
for row in l_match:
id1 = row[0]
id2 = row[1]
formattedName1 = f'{id1}-{l_data[id1][0]}'
formattedName2 = f'{id2}-{l_data[id2][0]}'
print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')
结果:
0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio
Stephan 很好地解释了代码。我不需要再解释了。您也可以尝试使用 fuzz.partial_ratio。它可能会提供一些有趣的结果。
from thefuzz import fuzz
print(fuzz.ratio("Turkey is the best country", "Turkey is the best country!"))
#98
print(fuzz.partial_ratio("Turkey is the best country", "Turkey is the best country!"))
#100
我正在尝试在多个字段上使用 Python 进行模糊匹配和分组。我想在不同的模糊阈值上对每一列进行比较。我尝试搜索 google 但找不到任何可以执行重复数据删除然后在不同列上创建组的解决方案。
输入:
Name | Address |
---|---|
Robert | 9185 Pumpkin Hill St. |
Rob | 9185 Pumpkin Hill Street |
Mike | 1296 Tunnel St. |
Mike | Tunnel Street 1296 |
John | 6200 Beechwood Drive |
输出:
Group ID | Name | Address |
---|---|---|
1 | Robert | 9185 Pumpkin Hill St. |
1 | Rob | 9185 Pumpkin Hill Street |
2 | Mike | 1296 Tunnel St. |
2 | Mike | Tunnel Street 1296 |
3 | John | 6200 Beechwood Drive |
我建议查看 Levenstein 距离,因为这是识别相似字符串的常用算法。图书馆 FuzzWuzzy(我知道的愚蠢的名字)用 3 种不同的方法实现它。请参阅此 article 了解更多信息
这里是将每个字符串与其他每个字符串进行比较的起点。你提到有不同的阈值,所以所有需要做的就是遍历 l_match
并根据你想要的阈值
#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz
l_data =[
['Robert','9185 Pumpkin Hill St.']
,['Rob','9185 Pumpkin Hill Street']
,['Mike','1296 Tunnel St.']
,['Mike','Tunnel Street 1296']
,['John','6200 Beechwood Drive']
]
l_match = []
#loop through data
for idx1,row1 in enumerate(l_data):
#compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
for idx2,row2 in enumerate(l_data[idx1+1:]):
#Calculates index in original array for row2
origIdx=idx1+idx2+1
l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])
#Print raw data with index
for idx,val in enumerate(l_data):
print(f'{idx}-{val}')
print ("*" * 100)
#Print results of comparison
for row in l_match:
id1 = row[0]
id2 = row[1]
formattedName1 = f'{id1}-{l_data[id1][0]}'
formattedName2 = f'{id2}-{l_data[id2][0]}'
print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')
结果:
0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio
Stephan 很好地解释了代码。我不需要再解释了。您也可以尝试使用 fuzz.partial_ratio。它可能会提供一些有趣的结果。
from thefuzz import fuzz
print(fuzz.ratio("Turkey is the best country", "Turkey is the best country!"))
#98
print(fuzz.partial_ratio("Turkey is the best country", "Turkey is the best country!"))
#100