如何通过模糊字符串与另一个数据框匹配来设置列值?
How to set a column value by fuzzy string matching with another dataframe?
我已经提到了 ,但无法针对我的具体情况将其转至 运行。我有两个数据框:
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Couper", 1: "Cruise", 2: "Pit"},
"fname": {0: "Brad", 1: "Tom", 2: "Brad"},
"score": {0: 3, 1: 3.5, 2: 4},
}
)
然后我做:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {
tup: fuzz.ratio(*tup)
for tup in product(df1["lname"].tolist(), df2["lname"].tolist())
}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
degrees = {
tup: fuzz.ratio(*tup)
for tup in product(df1["fname"].tolist(), df2["fname"].tolist())
}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
df2["lname"] = df2["lname"].map(s1).fillna(df2["lname"])
df2["fname"] = df2["fname"].map(s2).fillna(df2["fname"])
df = df1.merge(df2, on=["lname", "fname"], how="outer")
结果不是我所期望的。你能帮我编辑这段代码吗?请注意,我在 df1 中有数百万行,在 df2 中有数百万行,因此我也需要一些效率。
基本上,我需要将 df1 中的人与 df2 中的人进行匹配。在上面的示例中,我根据姓氏 (lname) 和名字 (fname) 匹配它们。我还有第三个,为了简洁我在这里省略了。
预期结果应如下所示:
ein ein_name lname fname score
0 1001 H for Humanity Cooper Bradley 3
1 1500 Labor Union Cruise Thomas 3.5
2 3000 Something something Pitt Brad 4
你可以试试这个:
from functools import cache
import pandas as pd
from fuzzywuzzy import fuzz
# First, define indices and values to check for matches
indices_and_values = [(i, value) for i, value in enumerate(df2["lname"] + df2["fname"])]
# Define helper functions to find matching rows and get corresponding score
@cache
def find_match(x):
return [i for i, value in indices_and_values if fuzz.ratio(x, value) > 75]
def get_score(x):
try:
return df2.loc[x[0], "score"]
except (KeyError, IndexError):
return pd.NA
# Add scores to df1:
df1["score"] = (
(df1["lname"] + df1["fname"])
.apply(find_match)
.apply(get_score)
)
然后:
print(df1)
ein ein_name lname fname score
0 1001 H for Humanity Cooper Bradley 3.0
1 1500 Labor Union Cruise Thomas 3.5
2 3000 Something something Pitt Brad 4.0
鉴于你的数据帧的大小,我想你有同名的(相同的名字和姓氏),因此使用 Python 标准库中的 @cache decorator 来尝试加快速度(但你可以没有它。
我已经提到了
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Couper", 1: "Cruise", 2: "Pit"},
"fname": {0: "Brad", 1: "Tom", 2: "Brad"},
"score": {0: 3, 1: 3.5, 2: 4},
}
)
然后我做:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {
tup: fuzz.ratio(*tup)
for tup in product(df1["lname"].tolist(), df2["lname"].tolist())
}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
degrees = {
tup: fuzz.ratio(*tup)
for tup in product(df1["fname"].tolist(), df2["fname"].tolist())
}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
df2["lname"] = df2["lname"].map(s1).fillna(df2["lname"])
df2["fname"] = df2["fname"].map(s2).fillna(df2["fname"])
df = df1.merge(df2, on=["lname", "fname"], how="outer")
结果不是我所期望的。你能帮我编辑这段代码吗?请注意,我在 df1 中有数百万行,在 df2 中有数百万行,因此我也需要一些效率。
基本上,我需要将 df1 中的人与 df2 中的人进行匹配。在上面的示例中,我根据姓氏 (lname) 和名字 (fname) 匹配它们。我还有第三个,为了简洁我在这里省略了。
预期结果应如下所示:
ein ein_name lname fname score
0 1001 H for Humanity Cooper Bradley 3
1 1500 Labor Union Cruise Thomas 3.5
2 3000 Something something Pitt Brad 4
你可以试试这个:
from functools import cache
import pandas as pd
from fuzzywuzzy import fuzz
# First, define indices and values to check for matches
indices_and_values = [(i, value) for i, value in enumerate(df2["lname"] + df2["fname"])]
# Define helper functions to find matching rows and get corresponding score
@cache
def find_match(x):
return [i for i, value in indices_and_values if fuzz.ratio(x, value) > 75]
def get_score(x):
try:
return df2.loc[x[0], "score"]
except (KeyError, IndexError):
return pd.NA
# Add scores to df1:
df1["score"] = (
(df1["lname"] + df1["fname"])
.apply(find_match)
.apply(get_score)
)
然后:
print(df1)
ein ein_name lname fname score
0 1001 H for Humanity Cooper Bradley 3.0
1 1500 Labor Union Cruise Thomas 3.5
2 3000 Something something Pitt Brad 4.0
鉴于你的数据帧的大小,我想你有同名的(相同的名字和姓氏),因此使用 Python 标准库中的 @cache decorator 来尝试加快速度(但你可以没有它。