Python 中的模糊查找
Fuzzy Lookup In Python
我有两个 CSV 文件。一个包含供应商数据,一个包含员工数据。与 excel 中的“模糊查找”类似,我希望进行两种类型的匹配并输出两个 csv 文件中的所有列,包括一个新列作为每行的相似比。在 excel 中,我会使用 0.80 阈值。下面是示例数据,我的实际数据在其中一个文件中有 200 万行,如果在 excel.
中完成,这将是一场噩梦
输出 1:
从供应商文件中,模糊匹配“供应商名称”与员工文件中的“员工姓名”。显示两个文件的所有列和一个新的相似比列
输出2:
从供应商文件中,将“SSN”与员工文件中的“SSN”进行模糊匹配。显示两个文件的所有列和一个新的相似比列
这是两个独立的输出
数据框 1:供应商数据
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
150
9675
GREEN
7412
70
One Time
774801971
200
15789
SMITH, JOHN
80
40
Employee
965214872
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
数据框 2:员工数据
Employee Name
Employee ID
Manager
SSN
BROWN, CLIFFORD
1
Manager 1
668-419-628
BLUE, CITY
2
Manager 2
874126487
SMITH, JOHN
3
Manager 3
965-21-4872
HAROON, SIMON
4
Manager 4
741-98-7820
预期输出 1 - 匹配名称
Employee Name
Employee ID
Manager
SSN
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
Similarity Ratio
BROWN, CLIFFORD
1
Manager 1
668-419-628
150
58421
CLIFFORD BROWN
854
500
Misc
668419628
1.00
SMITH, JOHN
3
Manager 3
965-21-4872
200
15789
SMITH, JOHN
80
40
Employee
965214872
1.00
HAROON, SIMON
4
Manager 4
741-98-7820
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
0.96
BLUE, CITY
2
Manager 2
874126487
0.00
预期输出 2 - 匹配 SSN
Employee Name
Employee ID
Manager
SSN
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
Similarity Ratio
BROWN, CLIFFORD
1
Manager 1
668-419-628
150
58421
CLIFFORD, BROWN
854
500
Misc
668419628
0.97
SMITH, JOHN
3
Manager 3
965-21-4872
200
15789
SMITH, JOHN
80
40
Employee
965214872
0.97
BLUE, CITY
2
Manager 2
874126487
0.00
HAROON, SIMON
4
Manager 4
741-98-7820
0.00
我试过下面的代码:
import pandas as pd
from fuzzywuzzy import fuzz
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Employee Data.xlsx')
matched_names = []
for row1 in df1.index:
name1 = df1._get_value(row1, 'Vendor Name')
for row2 in df2.index:
name2 = df2._get_value(row2, 'Full Name')
match = fuzz.ratio(name1, name2)
if match > 80: # This is the threshold
match.append([name1, name2, match])
df_ratio = pd.DataFrame(columns=['Vendor Name', 'Employee Name','match'], data=matched_names)
df_ratio.to_csv(r'directory\MatchingResults.csv', encoding='utf-8')
我只是没有得到我想要的结果,我准备重新发明整个脚本。任何建议都有助于改进我的脚本。请注意,我是 Python 的新手,所以请保持温和。我对这个例子的新方法完全开放。
9 月 23 日更新:
仍然有问题...我现在能够获得相似率,但无法从两个 CSV 文件中获得所有列。问题是这两个文件完全不同,所以当我连接时,它给出了 NaN 值。有什么建议么?新代码如下:
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
df1['full_name']= df1['Vendor Name']
df2['full_name'] = df2['Employee Name']
df1_name = df1['full_name']
df2_name = df2['full_name']
frames = [pd.DataFrame(df1), pd.DataFrame(df2)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
dfresult = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes
listOfDfs = [dfresult.loc[idx] for idx in np.split(dfresult.index, df.shape[0])]
DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(dfresult.shape[0])}
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
df = pd.DataFrame(list(DataFrameDict.items())).df.to_excel(r'Directory\TestOutput.xlsx', index = False)
为了水平连接两个 DataFrame,我通过匹配的 Vendor Name 的索引对齐 Employees DataFrame。如果没有 Vendor Name 匹配,我只是放一个空行。
更多详情:
- 我遍历了供应商名称,对于每个供应商名称,我将得分最高的员工姓名的索引添加到索引列表中。请注意,我向每个供应商名称添加了 最多一个 匹配的员工记录。
- 如果未找到匹配项(分数太低),我将手动添加的空记录的索引添加到 Employees Dataframe。
- 此索引列表随后用于重新排序 Employees DataDrame。
- 最后,我只是水平合并两个DataFrame。请注意,此时的两个 DataFrame 不必具有相同的大小,但在这种情况下,
concat
方法只是通过将缺失的行附加到较小的 DataFrame 来填补空白。
代码如下:
import numpy as np
import pandas as pd
from thefuzz import process as fuzzy_process # the new repository of fuzzywuzzy
# import dataframes
...
# adding empty row
employees_df = employees_df.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(employees_df) - 1
# matching between vendor and employee names
indexed_employee_names_dict = dict(enumerate(employees_df["Employee Name"]))
matched_employees = set()
ordered_employees = []
scores = []
for vendor_name in vendors_df["Vendor Name"]:
match = fuzzy_process.extractOne(
query=vendor_name,
choices=indexed_employee_names_dict,
score_cutoff=80
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_employees.add(index)
ordered_employees.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_employees = [i for i in range(len(employees_df)) if i not in matched_employees]
ordered_employees.extend(missing_employees)
ordered_employees_df = employees_df.iloc[ordered_employees].reset_index()
merged_df = pd.concat([vendors_df, ordered_employees_df], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_employees))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
merged_df = merged_df.sort_values("Similarity Ratio", ascending=False)
对于根据SSN列的匹配,完全可以用同样的方法来完成,只需替换上面代码中的列名即可。此外,该过程可以概括为一个接受数据帧和列名的函数:
def match_and_merge(df1: pd.DataFrame, df2: pd.DataFrame, col1: str, col2: str, cutoff: int = 80):
# adding empty row
df2 = df2.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(df2) - 1
# matching between vendor and employee names
indexed_strings_dict = dict(enumerate(df2[col2]))
matched_indices = set()
ordered_indices = []
scores = []
for s1 in df1[col1]:
match = fuzzy_process.extractOne(
query=s1,
choices=indexed_strings_dict,
score_cutoff=cutoff
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_indices.add(index)
ordered_indices.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_indices = [i for i in range(len(df2)) if i not in matched_indices]
ordered_indices.extend(missing_indices)
ordered_df2 = df2.iloc[ordered_indices].reset_index()
# merge rows of dataframes
merged_df = pd.concat([df1, ordered_df2], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_indices))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
return merged_df.sort_values("Similarity Ratio", ascending=False)
if __name__ == "__main__":
vendors_df = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
employees_df = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
merged_df = match_and_merge(vendors_df, employees_df, "Vendor Name", "Employee Name")
merged_df.to_excel("merged_by_names.xlsx", index=False)
merged_df = match_and_merge(vendors_df, employees_df, "SSN", "SSN")
merged_df.to_excel("merged_by_ssn.xlsx", index=False)
以上代码产生以下输出:
merged_by_names.xlsx
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
index
Employee Name
Employee ID
Manager
SSN
Similarity Ratio
200
15789
SMITH, JOHN
80
40
Employee
965214872
2
SMITH, JOHN
3
Manager 3
965-21-4872
1
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
0
BROWN, CLIFFORD
1
Manager 1
668-419-628
0.95
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
3
HAROON, SIMON
4
Manager 4
741-98-7820
0.92
150
9675
GREEN
7412
70
One Time
774801971
4
nan
nan
nan
nan
0
nan
nan
nan
nan
nan
nan
nan
1
BLUE, CITY
2
Manager 2
874126487
0
merged_by_ssn.xlsx
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
index
Employee Name
Employee ID
Manager
SSN
Similarity Ratio
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
3
HAROON, SIMON
4
Manager 4
741-98-7820
0.91
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
0
BROWN, CLIFFORD
1
Manager 1
668-419-628
0.9
200
15789
SMITH, JOHN
80
40
Employee
965214872
2
SMITH, JOHN
3
Manager 3
965-21-4872
0.9
150
9675
GREEN
7412
70
One Time
774801971
4
nan
nan
nan
nan
0
nan
nan
nan
nan
nan
nan
nan
1
BLUE, CITY
2
Manager 2
874126487
0
我有两个 CSV 文件。一个包含供应商数据,一个包含员工数据。与 excel 中的“模糊查找”类似,我希望进行两种类型的匹配并输出两个 csv 文件中的所有列,包括一个新列作为每行的相似比。在 excel 中,我会使用 0.80 阈值。下面是示例数据,我的实际数据在其中一个文件中有 200 万行,如果在 excel.
中完成,这将是一场噩梦输出 1: 从供应商文件中,模糊匹配“供应商名称”与员工文件中的“员工姓名”。显示两个文件的所有列和一个新的相似比列
输出2: 从供应商文件中,将“SSN”与员工文件中的“SSN”进行模糊匹配。显示两个文件的所有列和一个新的相似比列
这是两个独立的输出
数据框 1:供应商数据
Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN |
---|---|---|---|---|---|---|
15 | 58421 | CLIFFORD BROWN | 854 | 500 | Misc | 668419628 |
150 | 9675 | GREEN | 7412 | 70 | One Time | 774801971 |
200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 |
200 | 69997 | HAROON, SIMAN | 964 | 100 | Misc | 741-98-7821 |
数据框 2:员工数据
Employee Name | Employee ID | Manager | SSN |
---|---|---|---|
BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 |
BLUE, CITY | 2 | Manager 2 | 874126487 |
SMITH, JOHN | 3 | Manager 3 | 965-21-4872 |
HAROON, SIMON | 4 | Manager 4 | 741-98-7820 |
预期输出 1 - 匹配名称
Employee Name | Employee ID | Manager | SSN | Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN | Similarity Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 | 150 | 58421 | CLIFFORD BROWN | 854 | 500 | Misc | 668419628 | 1.00 |
SMITH, JOHN | 3 | Manager 3 | 965-21-4872 | 200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 | 1.00 |
HAROON, SIMON | 4 | Manager 4 | 741-98-7820 | 200 | 69997 | HAROON, SIMAN | 964 | 100 | Misc | 741-98-7821 | 0.96 |
BLUE, CITY | 2 | Manager 2 | 874126487 | 0.00 |
预期输出 2 - 匹配 SSN
Employee Name | Employee ID | Manager | SSN | Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN | Similarity Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 | 150 | 58421 | CLIFFORD, BROWN | 854 | 500 | Misc | 668419628 | 0.97 |
SMITH, JOHN | 3 | Manager 3 | 965-21-4872 | 200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 | 0.97 |
BLUE, CITY | 2 | Manager 2 | 874126487 | 0.00 | |||||||
HAROON, SIMON | 4 | Manager 4 | 741-98-7820 | 0.00 |
我试过下面的代码:
import pandas as pd
from fuzzywuzzy import fuzz
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Employee Data.xlsx')
matched_names = []
for row1 in df1.index:
name1 = df1._get_value(row1, 'Vendor Name')
for row2 in df2.index:
name2 = df2._get_value(row2, 'Full Name')
match = fuzz.ratio(name1, name2)
if match > 80: # This is the threshold
match.append([name1, name2, match])
df_ratio = pd.DataFrame(columns=['Vendor Name', 'Employee Name','match'], data=matched_names)
df_ratio.to_csv(r'directory\MatchingResults.csv', encoding='utf-8')
我只是没有得到我想要的结果,我准备重新发明整个脚本。任何建议都有助于改进我的脚本。请注意,我是 Python 的新手,所以请保持温和。我对这个例子的新方法完全开放。
9 月 23 日更新: 仍然有问题...我现在能够获得相似率,但无法从两个 CSV 文件中获得所有列。问题是这两个文件完全不同,所以当我连接时,它给出了 NaN 值。有什么建议么?新代码如下:
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
df1['full_name']= df1['Vendor Name']
df2['full_name'] = df2['Employee Name']
df1_name = df1['full_name']
df2_name = df2['full_name']
frames = [pd.DataFrame(df1), pd.DataFrame(df2)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
dfresult = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes
listOfDfs = [dfresult.loc[idx] for idx in np.split(dfresult.index, df.shape[0])]
DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(dfresult.shape[0])}
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
df = pd.DataFrame(list(DataFrameDict.items())).df.to_excel(r'Directory\TestOutput.xlsx', index = False)
为了水平连接两个 DataFrame,我通过匹配的 Vendor Name 的索引对齐 Employees DataFrame。如果没有 Vendor Name 匹配,我只是放一个空行。
更多详情:
- 我遍历了供应商名称,对于每个供应商名称,我将得分最高的员工姓名的索引添加到索引列表中。请注意,我向每个供应商名称添加了 最多一个 匹配的员工记录。
- 如果未找到匹配项(分数太低),我将手动添加的空记录的索引添加到 Employees Dataframe。
- 此索引列表随后用于重新排序 Employees DataDrame。
- 最后,我只是水平合并两个DataFrame。请注意,此时的两个 DataFrame 不必具有相同的大小,但在这种情况下,
concat
方法只是通过将缺失的行附加到较小的 DataFrame 来填补空白。
代码如下:
import numpy as np
import pandas as pd
from thefuzz import process as fuzzy_process # the new repository of fuzzywuzzy
# import dataframes
...
# adding empty row
employees_df = employees_df.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(employees_df) - 1
# matching between vendor and employee names
indexed_employee_names_dict = dict(enumerate(employees_df["Employee Name"]))
matched_employees = set()
ordered_employees = []
scores = []
for vendor_name in vendors_df["Vendor Name"]:
match = fuzzy_process.extractOne(
query=vendor_name,
choices=indexed_employee_names_dict,
score_cutoff=80
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_employees.add(index)
ordered_employees.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_employees = [i for i in range(len(employees_df)) if i not in matched_employees]
ordered_employees.extend(missing_employees)
ordered_employees_df = employees_df.iloc[ordered_employees].reset_index()
merged_df = pd.concat([vendors_df, ordered_employees_df], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_employees))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
merged_df = merged_df.sort_values("Similarity Ratio", ascending=False)
对于根据SSN列的匹配,完全可以用同样的方法来完成,只需替换上面代码中的列名即可。此外,该过程可以概括为一个接受数据帧和列名的函数:
def match_and_merge(df1: pd.DataFrame, df2: pd.DataFrame, col1: str, col2: str, cutoff: int = 80):
# adding empty row
df2 = df2.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(df2) - 1
# matching between vendor and employee names
indexed_strings_dict = dict(enumerate(df2[col2]))
matched_indices = set()
ordered_indices = []
scores = []
for s1 in df1[col1]:
match = fuzzy_process.extractOne(
query=s1,
choices=indexed_strings_dict,
score_cutoff=cutoff
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_indices.add(index)
ordered_indices.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_indices = [i for i in range(len(df2)) if i not in matched_indices]
ordered_indices.extend(missing_indices)
ordered_df2 = df2.iloc[ordered_indices].reset_index()
# merge rows of dataframes
merged_df = pd.concat([df1, ordered_df2], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_indices))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
return merged_df.sort_values("Similarity Ratio", ascending=False)
if __name__ == "__main__":
vendors_df = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
employees_df = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
merged_df = match_and_merge(vendors_df, employees_df, "Vendor Name", "Employee Name")
merged_df.to_excel("merged_by_names.xlsx", index=False)
merged_df = match_and_merge(vendors_df, employees_df, "SSN", "SSN")
merged_df.to_excel("merged_by_ssn.xlsx", index=False)
以上代码产生以下输出:
merged_by_names.xlsx
Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN | index | Employee Name | Employee ID | Manager | SSN | Similarity Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 | 2 | SMITH, JOHN | 3 | Manager 3 | 965-21-4872 | 1 |
15 | 58421 | CLIFFORD BROWN | 854 | 500 | Misc | 668419628 | 0 | BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 | 0.95 |
200 | 69997 | HAROON, SIMAN | 964 | 100 | Misc | 741-98-7821 | 3 | HAROON, SIMON | 4 | Manager 4 | 741-98-7820 | 0.92 |
150 | 9675 | GREEN | 7412 | 70 | One Time | 774801971 | 4 | nan | nan | nan | nan | 0 |
nan | nan | nan | nan | nan | nan | nan | 1 | BLUE, CITY | 2 | Manager 2 | 874126487 | 0 |
merged_by_ssn.xlsx
Company | Vendor ID | Vendor Name | Invoice Number | Transaction Amt | Vendor Type | SSN | index | Employee Name | Employee ID | Manager | SSN | Similarity Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | 69997 | HAROON, SIMAN | 964 | 100 | Misc | 741-98-7821 | 3 | HAROON, SIMON | 4 | Manager 4 | 741-98-7820 | 0.91 |
15 | 58421 | CLIFFORD BROWN | 854 | 500 | Misc | 668419628 | 0 | BROWN, CLIFFORD | 1 | Manager 1 | 668-419-628 | 0.9 |
200 | 15789 | SMITH, JOHN | 80 | 40 | Employee | 965214872 | 2 | SMITH, JOHN | 3 | Manager 3 | 965-21-4872 | 0.9 |
150 | 9675 | GREEN | 7412 | 70 | One Time | 774801971 | 4 | nan | nan | nan | nan | 0 |
nan | nan | nan | nan | nan | nan | nan | 1 | BLUE, CITY | 2 | Manager 2 | 874126487 | 0 |