Python fuzzywuzzy 错误字符串或缓冲区预期
Python fuzzywuzzy error string or buffer expect
我正在使用 fuzzywuzzy 在公司名称的 csv 中查找近似匹配项。我正在将手动匹配的字符串与未匹配的字符串进行比较,希望找到一些有用的近似匹配,但是,我在 fuzzywuzzy 中遇到了字符串或缓冲区错误。我的代码是:
from fuzzywuzzy import process
from pandas import read_csv
if __name__ == '__main__':
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
df_false = df[df['match_manual'].isnull()]
df_true = df[df['match_manual'].notnull()]
sss_false = df_false['sss'].values.tolist()
sss_true = df_true['sss'].values.tolist()
for sssf in sss_false:
mmm = process.extractOne(sssf, sss_true) # find best choice
print sssf + str(tuple(mmm))
这会产生以下错误:
Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer
这与使用指定编码导入 pandas 的效果有关,我添加它是为了防止 UnicodeDecodeErrors
但具有导致此错误的连锁反应。我尝试使用 str(sssf)
强制对象,但这不起作用。
所以,我在这里隔离了导致错误的一行:#N/A,,,,,,
(下面粘贴的代码中的第 29 行)。我假设是 #
导致了错误,但奇怪的是它不是,它是导致问题的 A
字符,因为该文件在删除时有效。令我感到奇怪的是,下面两行的字符串是 N/A
可以很好地解析,但是,当我删除 #
符号时,第 29 行不会解析,即使该字段看起来与该字段相同下面。
sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,
您的 sss_true
变量包含:
[
u'N21 LTD.',
u'N2 CHECK LIMITED',
u'N2 CHECK LTD',
u'N2 GROUP LTD',
u'N2 VISUAL COMMUNICATIONS LTD',
u'N3 DISPLAY GRAPHICS LTD',
u'N3O LIMITED',
u'N9 DESIGN',
nan # <---- note this
]
一旦你去掉那个 not-a-number 值,一切都会按预期开始工作。
默认情况下,pandas.read_csv
将字符串 'N/A'
解析为非数字 (NaN
)
在你的例子中,这意味着你最终得到一个 nan
值而不是一个字符串。在您的示例数据集中,这发生在两个地方
倒数第三行(您在问题中突出显示的行)导致 sss_false[-3] == nan
最后一行结果为 sss_true[-1] == nan
。
选项 1
如果要将字符串'N/A'
解析为字符串而不是nan
,方法是替换
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
和
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')
这些额外选项的含义在 pandas docs.
中有描述
na_values : list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to
所以,上面的修改告诉pandas将空字符串识别为NA并丢弃默认值'N/A'
选项 2
如果您想丢弃第一列中带有 'N/A'
的行,您需要从 sss_true
和 sss_false
中删除 nan
成员。一种方法是:
sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]
我正在使用 fuzzywuzzy 在公司名称的 csv 中查找近似匹配项。我正在将手动匹配的字符串与未匹配的字符串进行比较,希望找到一些有用的近似匹配,但是,我在 fuzzywuzzy 中遇到了字符串或缓冲区错误。我的代码是:
from fuzzywuzzy import process
from pandas import read_csv
if __name__ == '__main__':
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
df_false = df[df['match_manual'].isnull()]
df_true = df[df['match_manual'].notnull()]
sss_false = df_false['sss'].values.tolist()
sss_true = df_true['sss'].values.tolist()
for sssf in sss_false:
mmm = process.extractOne(sssf, sss_true) # find best choice
print sssf + str(tuple(mmm))
这会产生以下错误:
Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer
这与使用指定编码导入 pandas 的效果有关,我添加它是为了防止 UnicodeDecodeErrors
但具有导致此错误的连锁反应。我尝试使用 str(sssf)
强制对象,但这不起作用。
所以,我在这里隔离了导致错误的一行:#N/A,,,,,,
(下面粘贴的代码中的第 29 行)。我假设是 #
导致了错误,但奇怪的是它不是,它是导致问题的 A
字符,因为该文件在删除时有效。令我感到奇怪的是,下面两行的字符串是 N/A
可以很好地解析,但是,当我删除 #
符号时,第 29 行不会解析,即使该字段看起来与该字段相同下面。
sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,
您的 sss_true
变量包含:
[
u'N21 LTD.',
u'N2 CHECK LIMITED',
u'N2 CHECK LTD',
u'N2 GROUP LTD',
u'N2 VISUAL COMMUNICATIONS LTD',
u'N3 DISPLAY GRAPHICS LTD',
u'N3O LIMITED',
u'N9 DESIGN',
nan # <---- note this
]
一旦你去掉那个 not-a-number 值,一切都会按预期开始工作。
默认情况下,pandas.read_csv
将字符串 'N/A'
解析为非数字 (NaN
)
在你的例子中,这意味着你最终得到一个 nan
值而不是一个字符串。在您的示例数据集中,这发生在两个地方
倒数第三行(您在问题中突出显示的行)导致 sss_false[-3] == nan
最后一行结果为 sss_true[-1] == nan
。
选项 1
如果要将字符串'N/A'
解析为字符串而不是nan
,方法是替换
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
和
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')
这些额外选项的含义在 pandas docs.
中有描述na_values : list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to
所以,上面的修改告诉pandas将空字符串识别为NA并丢弃默认值'N/A'
选项 2
如果您想丢弃第一列中带有 'N/A'
的行,您需要从 sss_true
和 sss_false
中删除 nan
成员。一种方法是:
sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]