将名称列表与数据质量差的列匹配 (Python)
Matching a list of names to a column with bad data quality (Python)
我有一个包含 5 列的 table,其中一列是数据质量很差的姓名列表。我设法在 R 中尽可能地清理它,但它仍然看起来像这样(格式化为代码以便于阅读):
Neville Longbottomx
Severus Snape Slyth
Granger, Hermioone
Miss Lovegoo
Nott: Theodore
Mr Potter Gryffindor
Malfoy, Draco
Bulstrode, Millicent
McGonagall, Minerv
Seamus Finnigan Mister
Miss Abbott, Hannah
Ernie Macmillan M
Dumbledore, Albus
Parkinson, Pans" Slyth
现在,我有另一个列表,名称如下:
Lovegood, Luna
Longbottom, Neville
Macmillan, Ernie
Nott, Theodore
Parkinson, Pansy
我想在第一个列表中找到第二个列表中的名字。我查阅了关于此的不同文章并尝试了 this method 因为 ngrams 似乎是一种聪明的方法,但我首先遇到了这个错误:
def ngrams(string, n=3):
string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
company_names = names['NAMECOLUMN']
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(company_names)
Traceback (most recent call last):
File "<ipython-input-4-687c2896bcf2>", line 17, in <module>
tf_idf_matrix = vectorizer.fit_transform(company_names)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1305, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 752, in _count_vocab
for feature in analyze(doc):
File "<ipython-input-4-687c2896bcf2>", line 10, in ngrams
string = re.sub(r'[,-./]|\sBD',r'', string)
File "C:\Program Files\Anaconda3\lib\re.py", line 182, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
尝试将其作为字符串后:
ValueError: empty vocabulary; perhaps the documents only contain stop words
我什至不确定我正在朝着正确的方向前进,但这是最好的 link 我能找到符合我需要做的事情,但我不确定我需要做的更好。我是一个彻头彻尾的菜鸟 Python 也无济于事 :( 所以我希望你对我有耐心。
唉,如果能提供有关如何处理上述问题 and/or 代码的建议,我将不胜感激。
提前致谢!!
编辑: 完全忘了提及理想的解决方案是匹配并从我丑陋的 table 中获取完整的行,因为我需要存储在另一个中的信息名称列。
我建议看一下 fuzzywuzzy
包来进行这种匹配。根据您的需要,我认为过滤 fuzz.token_sort_ratio
或 fuzz.token_set_ratio
分数大于特定阈值(比如 75%)的名称就足够了
>>> from fuzzywuzzy import fuzz
>>> from itertools import takewhile
>>>
>>> lstA = ['Neville Longbottomx', 'Severus Snape Slyth', 'Granger, Hermioone', 'Miss Lovegoo', 'Nott: Theodore', 'Mr Potter Gryffindor', 'Malfoy, Draco', 'Bulstrode, Millicent', 'McGonagall, Minerv', 'Seamus Finnigan Mister', 'Miss Abbott, Hannah', 'Ernie Macmillan M', 'Dumbledore, Albus', 'Parkinson, Pans" Slyth']
>>> lstB = ['Lovegood, Luna', 'Longbottom, Neville', 'Macmillan, Ernie', 'Nott, Theodore', 'Parkinson, Pansy']
>>>
>>> dict((name,next(takewhile(lambda n: fuzz.token_sort_ratio(n, name)>75, lstA), '')) for name in lstB)
{'Lovegood, Luna': '', 'Longbottom, Neville': 'Neville Longbottomx', 'Macmillan, Ernie': '', 'Nott, Theodore': '', 'Parkinson, Pansy': ''}
您可以使用模糊匹配算法:)
from fuzzywuzzy import fuzz
a = ['Neville Longbottomx','Severus Snape Slyth','Granger, Hermioone','Miss Lovegoo',
'Nott: Theodore','Mr Potter Gryffindor','Malfoy, Draco','Bulstrode, Millicent',
'McGonagall, Minerv','Seamus Finnigan Mister','Miss Abbott, Hannah','Ernie Macmillan M',
'Dumbledore, Albus','Parkinson, Pans" Slyth']
b = ['Lovegood, Luna','Longbottom, Neville','Macmillan, Ernie','Nott, Theodore','Parkinson, Pansy']
get_match_a = []
for name1 in b:
for name2 in a:
if fuzz.partial_ratio(name2,name1)>50: # Tune this to fit your need
get_match_a.append(name2)
#print(name1,':',name2,'||',fuzz.partial_ratio(name2,name1))
#uncomment above to see the matching
正如您在下面看到的,它运行良好。我希望这会引导你去你想去的地方:)
我有一个包含 5 列的 table,其中一列是数据质量很差的姓名列表。我设法在 R 中尽可能地清理它,但它仍然看起来像这样(格式化为代码以便于阅读):
Neville Longbottomx
Severus Snape Slyth
Granger, Hermioone
Miss Lovegoo
Nott: Theodore
Mr Potter Gryffindor
Malfoy, Draco
Bulstrode, Millicent
McGonagall, Minerv
Seamus Finnigan Mister
Miss Abbott, Hannah
Ernie Macmillan M
Dumbledore, Albus
Parkinson, Pans" Slyth
现在,我有另一个列表,名称如下:
Lovegood, Luna
Longbottom, Neville
Macmillan, Ernie
Nott, Theodore
Parkinson, Pansy
我想在第一个列表中找到第二个列表中的名字。我查阅了关于此的不同文章并尝试了 this method 因为 ngrams 似乎是一种聪明的方法,但我首先遇到了这个错误:
def ngrams(string, n=3):
string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
company_names = names['NAMECOLUMN']
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(company_names)
Traceback (most recent call last):
File "<ipython-input-4-687c2896bcf2>", line 17, in <module>
tf_idf_matrix = vectorizer.fit_transform(company_names)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1305, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 752, in _count_vocab
for feature in analyze(doc):
File "<ipython-input-4-687c2896bcf2>", line 10, in ngrams
string = re.sub(r'[,-./]|\sBD',r'', string)
File "C:\Program Files\Anaconda3\lib\re.py", line 182, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
尝试将其作为字符串后:
ValueError: empty vocabulary; perhaps the documents only contain stop words
我什至不确定我正在朝着正确的方向前进,但这是最好的 link 我能找到符合我需要做的事情,但我不确定我需要做的更好。我是一个彻头彻尾的菜鸟 Python 也无济于事 :( 所以我希望你对我有耐心。
唉,如果能提供有关如何处理上述问题 and/or 代码的建议,我将不胜感激。
提前致谢!!
编辑: 完全忘了提及理想的解决方案是匹配并从我丑陋的 table 中获取完整的行,因为我需要存储在另一个中的信息名称列。
我建议看一下 fuzzywuzzy
包来进行这种匹配。根据您的需要,我认为过滤 fuzz.token_sort_ratio
或 fuzz.token_set_ratio
分数大于特定阈值(比如 75%)的名称就足够了
>>> from fuzzywuzzy import fuzz
>>> from itertools import takewhile
>>>
>>> lstA = ['Neville Longbottomx', 'Severus Snape Slyth', 'Granger, Hermioone', 'Miss Lovegoo', 'Nott: Theodore', 'Mr Potter Gryffindor', 'Malfoy, Draco', 'Bulstrode, Millicent', 'McGonagall, Minerv', 'Seamus Finnigan Mister', 'Miss Abbott, Hannah', 'Ernie Macmillan M', 'Dumbledore, Albus', 'Parkinson, Pans" Slyth']
>>> lstB = ['Lovegood, Luna', 'Longbottom, Neville', 'Macmillan, Ernie', 'Nott, Theodore', 'Parkinson, Pansy']
>>>
>>> dict((name,next(takewhile(lambda n: fuzz.token_sort_ratio(n, name)>75, lstA), '')) for name in lstB)
{'Lovegood, Luna': '', 'Longbottom, Neville': 'Neville Longbottomx', 'Macmillan, Ernie': '', 'Nott, Theodore': '', 'Parkinson, Pansy': ''}
您可以使用模糊匹配算法:)
from fuzzywuzzy import fuzz
a = ['Neville Longbottomx','Severus Snape Slyth','Granger, Hermioone','Miss Lovegoo',
'Nott: Theodore','Mr Potter Gryffindor','Malfoy, Draco','Bulstrode, Millicent',
'McGonagall, Minerv','Seamus Finnigan Mister','Miss Abbott, Hannah','Ernie Macmillan M',
'Dumbledore, Albus','Parkinson, Pans" Slyth']
b = ['Lovegood, Luna','Longbottom, Neville','Macmillan, Ernie','Nott, Theodore','Parkinson, Pansy']
get_match_a = []
for name1 in b:
for name2 in a:
if fuzz.partial_ratio(name2,name1)>50: # Tune this to fit your need
get_match_a.append(name2)
#print(name1,':',name2,'||',fuzz.partial_ratio(name2,name1))
#uncomment above to see the matching
正如您在下面看到的,它运行良好。我希望这会引导你去你想去的地方:)