模糊复制 pandas
fuzzy duplicated with pandas
我有 1 个 DataFrame 包含 2 列字符串数据。我需要比较列 'NameTest' 和 'Name'。我希望“NameTest”列中的每个名称都与列 'Name' 中的所有名称进行比较。如果他们匹配超过 80%,则打印最接近的匹配名称。
*我的数据框
NameTest
Name
0
john carry
john carrt
1
alex midlane
john crat
2
robert patt
alex mid
3
david baker
alex
4
NaN
patt
5
NaN
robert
6
NaN
david baker
我的代码
from fuzzywuzzy import fuzz, process
import pandas as pd
import numpy as np
import difflib
cols = ["Name", "NameTest"]
df = pd.read_excel(
r'D:\FFOutput\name.xlsx', usecols=cols,) # Read Excel
for i, row in df.iterrows():
na = row.Name
ne = row.NameTest
print([ne, na])
for i in na:
c = difflib.SequenceMatcher(isjunk=None, a=ne, b=na)
diff = c.ratio()*100
diff = round(diff, 1)
if diff >= 80:
print(na, diff)
有什么建议吗?
感谢您的帮助
为此,FuzzyWuzzy 提供了 process.extractOne
,它会搜索高于分数阈值的最佳匹配。搜索名称 len(df)
次需要 len(df) * len(df)
次比较(假设没有元素是 np.nan),这对于更大的表来说会变得非常耗时。这就是为什么我要在我的回答中使用 RapidFuzz(我是作者),这样会快很多。但是,如果性能与任务无关,您可以简单地将 import 语句替换为 fuzzywuzzy。
您可以按以下方式重写您的代码:
import numpy as np
import pandas as pd
from rapidfuzz import process, fuzz
df = pd.DataFrame({
"NameTest": ["john carry", "alex midlane", "robert patt", "david baker", np.nan, np.nan, np.nan],
"Name": ["john carrt", "john crat", "alex mid", "alex", "patt", "robert", "david baker"]
})
# filter out non strings, since they are notsupported by rapidfuzz/fuzzywuzzy/difflib
Names = [name for name in df["Name"] if isinstance(name, str)]
for NameTest in df["NameTest"]:
if isinstance(NameTest, str):
match = process.extractOne(
NameTest, Names,
scorer=fuzz.ratio,
processor=None,
score_cutoff=80)
if match:
print(match[0], match[1])
打印:
john carrt 90.0
alex mid 80.0
david baker 100.0
我有 1 个 DataFrame 包含 2 列字符串数据。我需要比较列 'NameTest' 和 'Name'。我希望“NameTest”列中的每个名称都与列 'Name' 中的所有名称进行比较。如果他们匹配超过 80%,则打印最接近的匹配名称。
*我的数据框
NameTest | Name | |
---|---|---|
0 | john carry | john carrt |
1 | alex midlane | john crat |
2 | robert patt | alex mid |
3 | david baker | alex |
4 | NaN | patt |
5 | NaN | robert |
6 | NaN | david baker |
我的代码
from fuzzywuzzy import fuzz, process
import pandas as pd
import numpy as np
import difflib
cols = ["Name", "NameTest"]
df = pd.read_excel(
r'D:\FFOutput\name.xlsx', usecols=cols,) # Read Excel
for i, row in df.iterrows():
na = row.Name
ne = row.NameTest
print([ne, na])
for i in na:
c = difflib.SequenceMatcher(isjunk=None, a=ne, b=na)
diff = c.ratio()*100
diff = round(diff, 1)
if diff >= 80:
print(na, diff)
有什么建议吗?
感谢您的帮助
为此,FuzzyWuzzy 提供了 process.extractOne
,它会搜索高于分数阈值的最佳匹配。搜索名称 len(df)
次需要 len(df) * len(df)
次比较(假设没有元素是 np.nan),这对于更大的表来说会变得非常耗时。这就是为什么我要在我的回答中使用 RapidFuzz(我是作者),这样会快很多。但是,如果性能与任务无关,您可以简单地将 import 语句替换为 fuzzywuzzy。
您可以按以下方式重写您的代码:
import numpy as np
import pandas as pd
from rapidfuzz import process, fuzz
df = pd.DataFrame({
"NameTest": ["john carry", "alex midlane", "robert patt", "david baker", np.nan, np.nan, np.nan],
"Name": ["john carrt", "john crat", "alex mid", "alex", "patt", "robert", "david baker"]
})
# filter out non strings, since they are notsupported by rapidfuzz/fuzzywuzzy/difflib
Names = [name for name in df["Name"] if isinstance(name, str)]
for NameTest in df["NameTest"]:
if isinstance(NameTest, str):
match = process.extractOne(
NameTest, Names,
scorer=fuzz.ratio,
processor=None,
score_cutoff=80)
if match:
print(match[0], match[1])
打印:
john carrt 90.0
alex mid 80.0
david baker 100.0