使用 pyspark 或 python 进行模糊匹配
Fuzzy matching with pyspark or python
我正在尝试使用 pyspark 或 python 进行模糊匹配,其中我有 2 个列表。
我。城市标准值列表
Clarksburg
Fremont
San Leandro
Albuquerque
Columbus
San Jose
Martinez
New York
Alhambra
Unknown
Las Vegas
Dublin
Niagara Falls
ii.城市名称拼写错误列表
Clarksburg
Closed 10/97
Fre,Nont
Fremong
San L:Eandro
Albuquerue
Clmbs
Sanjse
Martinz
New Yrk
Alambra
00011
L Vegas
Vegas
Ssan jose
Nw Yrk
Colmbus
Klarkburg
Alburque
Dublin
Niegara F
现在我想将拼写错误的城市名称与标准值列表进行匹配,并创建另一个具有适当匹配的列表。我正在寻找以下输出
Clarksburg - Clarksburg
Closed 10/97 - Unknown
Fre,Nont - Fremont
Fremong - Fremont
San L:Eandro - San Leandro
Albuquerue - Albuquerque
Clmbs -Columbus
Sanjse - San Jose
Martinz - Martinez
New Yrk - New York
Alambra - Alhambra
00011 - Unknown
L Vegas - Las Vegas
Vegas - Las Vegas
Ssan jose - San Jose
Nw Yrk - New York
Colmbus - Columbus
Klarkburg - Clarksburg
Alburque - Albuquerque
Dublin - Dublin
Niegara F - Niagara Falls
任何帮助都会对我有帮助。提前致谢。
使用 fuzzywuzzy
,并更改 threshold
以满足您的要求:
from fuzzywuzzy import process
threshold = 40
matchlist = [x for x in """
Clarksburg
Fremont
San Leandro
Albuquerque
Columbus
San Jose
Martinez
New York
Alhambra
Unknown
Las Vegas
Dublin
Niagara Falls
""".splitlines() if x]
checklist = [x for x in """
Clarksburg
Closed 10/97
Fre,Nont
Fremong
San L:Eandro
Albuquerue
Clmbs
Sanjse
Martinz
New Yrk
Alambra
00011
L Vegas
Vegas
Ssan jose
Nw Yrk
Colmbus
Klarkburg
Alburque
Dublin
Niegara F
""".splitlines() if x]
for check in checklist:
match = process.extractOne(check, matchlist)
print(f"{check} - {match[0] if match[1] > threshold else 'Unknown'}")
这给了我:
Clarksburg - Clarksburg
Closed 10/97 - Unknown
Fre,Nont - Fremont
Fremong - Fremont
San L:Eandro - San Leandro
Albuquerue - Albuquerque
Clmbs - Columbus
Sanjse - San Jose
Martinz - Martinez
New Yrk - New York
Alambra - Alhambra
00011 - Unknown
L Vegas - Las Vegas
Vegas - Las Vegas
Ssan jose - San Jose
Nw Yrk - New York
Colmbus - Columbus
Klarkburg - Clarksburg
Alburque - Albuquerque
Dublin - Dublin
Niegara F - Niagara Falls
我正在尝试使用 pyspark 或 python 进行模糊匹配,其中我有 2 个列表。
我。城市标准值列表
Clarksburg
Fremont
San Leandro
Albuquerque
Columbus
San Jose
Martinez
New York
Alhambra
Unknown
Las Vegas
Dublin
Niagara Falls
ii.城市名称拼写错误列表
Clarksburg
Closed 10/97
Fre,Nont
Fremong
San L:Eandro
Albuquerue
Clmbs
Sanjse
Martinz
New Yrk
Alambra
00011
L Vegas
Vegas
Ssan jose
Nw Yrk
Colmbus
Klarkburg
Alburque
Dublin
Niegara F
现在我想将拼写错误的城市名称与标准值列表进行匹配,并创建另一个具有适当匹配的列表。我正在寻找以下输出
Clarksburg - Clarksburg
Closed 10/97 - Unknown
Fre,Nont - Fremont
Fremong - Fremont
San L:Eandro - San Leandro
Albuquerue - Albuquerque
Clmbs -Columbus
Sanjse - San Jose
Martinz - Martinez
New Yrk - New York
Alambra - Alhambra
00011 - Unknown
L Vegas - Las Vegas
Vegas - Las Vegas
Ssan jose - San Jose
Nw Yrk - New York
Colmbus - Columbus
Klarkburg - Clarksburg
Alburque - Albuquerque
Dublin - Dublin
Niegara F - Niagara Falls
任何帮助都会对我有帮助。提前致谢。
使用 fuzzywuzzy
,并更改 threshold
以满足您的要求:
from fuzzywuzzy import process
threshold = 40
matchlist = [x for x in """
Clarksburg
Fremont
San Leandro
Albuquerque
Columbus
San Jose
Martinez
New York
Alhambra
Unknown
Las Vegas
Dublin
Niagara Falls
""".splitlines() if x]
checklist = [x for x in """
Clarksburg
Closed 10/97
Fre,Nont
Fremong
San L:Eandro
Albuquerue
Clmbs
Sanjse
Martinz
New Yrk
Alambra
00011
L Vegas
Vegas
Ssan jose
Nw Yrk
Colmbus
Klarkburg
Alburque
Dublin
Niegara F
""".splitlines() if x]
for check in checklist:
match = process.extractOne(check, matchlist)
print(f"{check} - {match[0] if match[1] > threshold else 'Unknown'}")
这给了我:
Clarksburg - Clarksburg
Closed 10/97 - Unknown
Fre,Nont - Fremont
Fremong - Fremont
San L:Eandro - San Leandro
Albuquerue - Albuquerque
Clmbs - Columbus
Sanjse - San Jose
Martinz - Martinez
New Yrk - New York
Alambra - Alhambra
00011 - Unknown
L Vegas - Las Vegas
Vegas - Las Vegas
Ssan jose - San Jose
Nw Yrk - New York
Colmbus - Columbus
Klarkburg - Clarksburg
Alburque - Albuquerque
Dublin - Dublin
Niegara F - Niagara Falls