不同模糊率的模糊匹配
Fuzzy Matching with different fuzz ratios
我有两个大数据集。 df1 大约是 1m 行,df2 大约是 10m 行。我需要从 df2 中找到 df1 中的行的匹配项。
我已经 post 单独编辑了这个问题的原始版本。参见 。 @laurent 回答得很好,但我现在有一些额外的细节。我现在想:
在我最终匹配的数据帧的一列中获取每个 fname 和 lname 的模糊比率
编写代码,使 fname 的模糊率设置为 >60,而 lname 的模糊率设置为 >75。换句话说,如果 fname>60 的 fuzz_ratio 和 lname>75 的 fuzz ratio,则发生真正的匹配;否则不是真正的匹配。如果 fname==80 的 fuzz ratio 而 lname==60 的 fuzz ratio,则匹配不正确。虽然我知道这可以从 (1) 中作为 post-hoc 过滤来完成,但在不同匹配的编码阶段这样做是有意义的。
我post这里是我的数据示例。 @laurent 对原问题的解决方案可以在上面link.
中找到
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
"fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
"score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
}
)
预期输出为:
df3 = pd.DataFrame(
{
"df1_ein": {0: 1001, 1: 1500, 2: 3000},
"df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
"fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
"fuzz_ratio_fname: {0: 62, 1: 67, 2: NA},
"df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
"df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
"df2_score": {0: 2.5, 1: 3.5, 2: NA},
}
)
注意上面的预期输出:根据我分配的模糊比率,Bradley Cupper 不适合 Bradley Cooper。更适合布拉德利库珀的是 M Brad Couper。同样,Thomas Cruise 匹配的是 Tom Cruise 而不是 Thomas Cruz。
我主要是 Stata 的用户(哈哈)并且 reclink2 ado 文件理论上可以完成上述操作,即如果 Stata 可以处理数据的大小。但是,以我拥有的数据量,几个小时后甚至没有开始。
这是一种方法:
import pandas as pd
from fuzzywuzzy import fuzz
# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]
# Add new columns
df1["fuzz_ratio_lname"] = (
df1["df1_lname"]
.apply(
lambda x: max(
[(value, fuzz.ratio(x, value)) for value in df2["lname"]],
key=lambda x: x[1],
)
)
.apply(lambda x: x if x[1] > 75 else pd.NA)
)
df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
.drop(columns="lname")
.rename(columns={"fname": "df2_fname"})
)
df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
ratio = fuzz.ratio(x, value)
df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA
# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
[
"df1_ein",
"df1_ein_name",
"df1_lname",
"df1_fname",
"fuzz_ratio_lname",
"fuzz_ratio_fname",
"df2_lname",
"df2_fname",
"score",
]
]
print(df1)
# Output
df1_ein df1_ein_name df1_lname df1_fname fuzz_ratio_lname \
0 1001 H for Humanity Cooper Bradley 83.0
1 1500 Labor Union Cruise Thomas 100.0
2 3000 Something something Pitt Brad NaN
fuzz_ratio_fname df2_lname df2_fname score
0 62.0 Couper M Brad 2.5
1 67.0 Cruise Tom 3.5
2 <NA> <NA> <NA> NaN
我有两个大数据集。 df1 大约是 1m 行,df2 大约是 10m 行。我需要从 df2 中找到 df1 中的行的匹配项。
我已经 post 单独编辑了这个问题的原始版本。参见
在我最终匹配的数据帧的一列中获取每个 fname 和 lname 的模糊比率
编写代码,使 fname 的模糊率设置为 >60,而 lname 的模糊率设置为 >75。换句话说,如果 fname>60 的 fuzz_ratio 和 lname>75 的 fuzz ratio,则发生真正的匹配;否则不是真正的匹配。如果 fname==80 的 fuzz ratio 而 lname==60 的 fuzz ratio,则匹配不正确。虽然我知道这可以从 (1) 中作为 post-hoc 过滤来完成,但在不同匹配的编码阶段这样做是有意义的。
我post这里是我的数据示例。 @laurent 对原问题的解决方案可以在上面link.
中找到import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
"fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
"score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
}
)
预期输出为:
df3 = pd.DataFrame(
{
"df1_ein": {0: 1001, 1: 1500, 2: 3000},
"df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
"fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
"fuzz_ratio_fname: {0: 62, 1: 67, 2: NA},
"df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
"df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
"df2_score": {0: 2.5, 1: 3.5, 2: NA},
}
)
注意上面的预期输出:根据我分配的模糊比率,Bradley Cupper 不适合 Bradley Cooper。更适合布拉德利库珀的是 M Brad Couper。同样,Thomas Cruise 匹配的是 Tom Cruise 而不是 Thomas Cruz。
我主要是 Stata 的用户(哈哈)并且 reclink2 ado 文件理论上可以完成上述操作,即如果 Stata 可以处理数据的大小。但是,以我拥有的数据量,几个小时后甚至没有开始。
这是一种方法:
import pandas as pd
from fuzzywuzzy import fuzz
# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]
# Add new columns
df1["fuzz_ratio_lname"] = (
df1["df1_lname"]
.apply(
lambda x: max(
[(value, fuzz.ratio(x, value)) for value in df2["lname"]],
key=lambda x: x[1],
)
)
.apply(lambda x: x if x[1] > 75 else pd.NA)
)
df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
.drop(columns="lname")
.rename(columns={"fname": "df2_fname"})
)
df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
ratio = fuzz.ratio(x, value)
df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA
# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
[
"df1_ein",
"df1_ein_name",
"df1_lname",
"df1_fname",
"fuzz_ratio_lname",
"fuzz_ratio_fname",
"df2_lname",
"df2_fname",
"score",
]
]
print(df1)
# Output
df1_ein df1_ein_name df1_lname df1_fname fuzz_ratio_lname \
0 1001 H for Humanity Cooper Bradley 83.0
1 1500 Labor Union Cruise Thomas 100.0
2 3000 Something something Pitt Brad NaN
fuzz_ratio_fname df2_lname df2_fname score
0 62.0 Couper M Brad 2.5
1 67.0 Cruise Tom 3.5
2 <NA> <NA> <NA> NaN