尝试在 Python 中执行模糊匹配
Trying to Perform Fuzzy Matching in Python
我正在尝试执行一个 fuzzywuzzy 命令来比较数据框中的两列。我想知道一列 ('Relationship') 中的字符串是否存在于另一列 ('CUST_NAME') 中,甚至部分存在。然后在与先前 ('CUST_NAME') 相同的列上对第二列 ('Dealer_Name') 重复该过程。我目前正在尝试 运行 以下代码:
这是我的数据框:
RapDF1 = RapDF[['APP_KEY','Relationship','Dealer_Name','CUST_NAME']]
这里是模糊匹配:
from fuzzywuzzy import process, fuzz
RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
但是,我收到以下 python 错误:
TypeError Traceback (most recent call last)
<ipython-input-76-2faf28514c26> in <module>()
52 # Attempt 7
53
---> 54 RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
55 RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
56
<ipython-input-76-2faf28514c26> in <listcomp>(.0)
52 # Attempt 7
53
---> 54 RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
55 RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
56
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\process.py in extract(query, choices, processor, scorer, limit)
166 """
167 sl = extractWithoutOrder(query, choices, processor, scorer)
--> 168 return heapq.nlargest(limit, sl, key=lambda i: i[1]) if limit is not None else \
169 sorted(sl, key=lambda i: i[1], reverse=True)
170
C:\ProgramData\Anaconda3\lib\heapq.py in nlargest(n, iterable, key)
567 # General case, slowest method
568 it = iter(iterable)
--> 569 result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
570 if not result:
571 return result
C:\ProgramData\Anaconda3\lib\heapq.py in <listcomp>(.0)
567 # General case, slowest method
568 it = iter(iterable)
--> 569 result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
570 if not result:
571 return result
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\process.py in extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
76
77 # Run the processor on the input query.
---> 78 processed_query = processor(query)
79
80 if len(processed_query) == 0:
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\utils.py in full_process(s, force_ascii)
93 s = asciidammit(s)
94 # Keep only Letters and Numbers (see Unicode docs).
---> 95 string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
96 # Force into lowercase.
97 string_out = StringProcessor.to_lower_case(string_out)
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\string_processing.py in replace_non_letters_non_numbers_with_whitespace(cls, a_string)
24 numbers with a single white space.
25 """
---> 26 return cls.regex.sub(" ", a_string)
27
28 strip = staticmethod(string.strip)
TypeError: expected string or bytes-like object
数据框中可能有 nan
个值,nan
的类型为 float 并导致错误:
from fuzzywuzzy import process, fuzz
import pandas as pd
import numpy as np
df_nan = pd.DataFrame({'text1': ["quick", "brown", "fox"], "text2": ["hello", np.NaN, "world"]})
df_nan
Out:
text1 text2
0 quick hello
1 brown NaN
2 fox world
只是导致相同错误的代码示例:
[process.extract(i, df_nan['text1'], limit=3) for i in df_nan['text2']]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
/usr/local/lib/python3.6/dist-packages/fuzzywuzzy/string_processing.py in replace_non_letters_non_numbers_with_whitespace(cls, a_string)
24 numbers with a single white space.
25 """
---> 26 return cls.regex.sub(" ", a_string)
27
28 strip = staticmethod(string.strip)
TypeError: expected string or bytes-like object
用一些标记替换 nan
(选择正确的标记将是一项艰巨且依赖于数据的任务,可能空字符串是一个糟糕的选择):
df = df_nan.fillna('##SOME_TOKEN##')
[process.extract(i, df['text1'], limit=3) for i in df['text2']]
Out:
[[('fox', 36, 2), ('brown', 20, 1), ('quick', 0, 0)],
[('brown', 36, 1), ('fox', 30, 2), ('quick', 18, 0)],
[('fox', 30, 2), ('brown', 20, 1), ('quick', 0, 0)]]
我想替换或删除所有非字符串值会有帮助。
我正在尝试执行一个 fuzzywuzzy 命令来比较数据框中的两列。我想知道一列 ('Relationship') 中的字符串是否存在于另一列 ('CUST_NAME') 中,甚至部分存在。然后在与先前 ('CUST_NAME') 相同的列上对第二列 ('Dealer_Name') 重复该过程。我目前正在尝试 运行 以下代码:
这是我的数据框:
RapDF1 = RapDF[['APP_KEY','Relationship','Dealer_Name','CUST_NAME']]
这里是模糊匹配:
from fuzzywuzzy import process, fuzz
RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
但是,我收到以下 python 错误:
TypeError Traceback (most recent call last)
<ipython-input-76-2faf28514c26> in <module>()
52 # Attempt 7
53
---> 54 RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
55 RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
56
<ipython-input-76-2faf28514c26> in <listcomp>(.0)
52 # Attempt 7
53
---> 54 RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
55 RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
56
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\process.py in extract(query, choices, processor, scorer, limit)
166 """
167 sl = extractWithoutOrder(query, choices, processor, scorer)
--> 168 return heapq.nlargest(limit, sl, key=lambda i: i[1]) if limit is not None else \
169 sorted(sl, key=lambda i: i[1], reverse=True)
170
C:\ProgramData\Anaconda3\lib\heapq.py in nlargest(n, iterable, key)
567 # General case, slowest method
568 it = iter(iterable)
--> 569 result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
570 if not result:
571 return result
C:\ProgramData\Anaconda3\lib\heapq.py in <listcomp>(.0)
567 # General case, slowest method
568 it = iter(iterable)
--> 569 result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
570 if not result:
571 return result
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\process.py in extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
76
77 # Run the processor on the input query.
---> 78 processed_query = processor(query)
79
80 if len(processed_query) == 0:
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\utils.py in full_process(s, force_ascii)
93 s = asciidammit(s)
94 # Keep only Letters and Numbers (see Unicode docs).
---> 95 string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
96 # Force into lowercase.
97 string_out = StringProcessor.to_lower_case(string_out)
C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\string_processing.py in replace_non_letters_non_numbers_with_whitespace(cls, a_string)
24 numbers with a single white space.
25 """
---> 26 return cls.regex.sub(" ", a_string)
27
28 strip = staticmethod(string.strip)
TypeError: expected string or bytes-like object
数据框中可能有 nan
个值,nan
的类型为 float 并导致错误:
from fuzzywuzzy import process, fuzz
import pandas as pd
import numpy as np
df_nan = pd.DataFrame({'text1': ["quick", "brown", "fox"], "text2": ["hello", np.NaN, "world"]})
df_nan
Out:
text1 text2
0 quick hello
1 brown NaN
2 fox world
只是导致相同错误的代码示例:
[process.extract(i, df_nan['text1'], limit=3) for i in df_nan['text2']]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
/usr/local/lib/python3.6/dist-packages/fuzzywuzzy/string_processing.py in replace_non_letters_non_numbers_with_whitespace(cls, a_string)
24 numbers with a single white space.
25 """
---> 26 return cls.regex.sub(" ", a_string)
27
28 strip = staticmethod(string.strip)
TypeError: expected string or bytes-like object
用一些标记替换 nan
(选择正确的标记将是一项艰巨且依赖于数据的任务,可能空字符串是一个糟糕的选择):
df = df_nan.fillna('##SOME_TOKEN##')
[process.extract(i, df['text1'], limit=3) for i in df['text2']]
Out:
[[('fox', 36, 2), ('brown', 20, 1), ('quick', 0, 0)],
[('brown', 36, 1), ('fox', 30, 2), ('quick', 18, 0)],
[('fox', 30, 2), ('brown', 20, 1), ('quick', 0, 0)]]
我想替换或删除所有非字符串值会有帮助。