R 中的模糊合并
Fuzzy merging in R
如果两个对象在语义上不同,如何连接它们?
1.Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
2.195/75 R16C lid CORDIANT Business CA
但这是同一个产品,因为与其文章相符 195/75 R16С
和一个例子
1.185/75 R16C lid Forward Professional 156 ASHK tubeless
2.The tire `185/75 R16С` С-156
185/75 R16C
关于这个主题的新问题
R:Error in compare.linkage : Data sets have different format
所以这是一个使用 RecordLinkage package 的解决方案。我认为这可以满足您的需求。
示例数据:
library(tidyverse)
library(RecordLinkage)
df_01 <- tibble(
product = c("Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season",
"Something else")
)
df_02 <- tibble(
product = c("195/75 R16C lid CORDIANT Business CA",
"Different Product")
)
下一部分的详细信息最好留给 RecordLinkage 文档:
rpairs_jar <- compare.linkage(df_01, df_02,
strcmp = c("product"),
strcmpfun = jarowinkler)
rpairs_epiwt <- epiWeights(rpairs_jar)
getPairs(rpairs_epiwt, max.weight = Inf, min.weight = -Inf)
id product Weight
1 1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
2 1 195/75 R16C lid CORDIANT Business CA 0.6135377
3
4 2 Something else
5 2 Different Product 0.4827264
6
7 1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
8 2 Different Product 0.4586156
9
10 2 Something else
11 1 195/75 R16C lid CORDIANT Business CA 0.4320106
因此,这会导致两行匹配的概率。如您所见,您要匹配的行 return 权重最高。
如果两个对象在语义上不同,如何连接它们?
1.Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
2.195/75 R16C lid CORDIANT Business CA
但这是同一个产品,因为与其文章相符 195/75 R16С
和一个例子
1.185/75 R16C lid Forward Professional 156 ASHK tubeless
2.The tire `185/75 R16С` С-156
185/75 R16C
关于这个主题的新问题 R:Error in compare.linkage : Data sets have different format
所以这是一个使用 RecordLinkage package 的解决方案。我认为这可以满足您的需求。
示例数据:
library(tidyverse)
library(RecordLinkage)
df_01 <- tibble(
product = c("Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season",
"Something else")
)
df_02 <- tibble(
product = c("195/75 R16C lid CORDIANT Business CA",
"Different Product")
)
下一部分的详细信息最好留给 RecordLinkage 文档:
rpairs_jar <- compare.linkage(df_01, df_02,
strcmp = c("product"),
strcmpfun = jarowinkler)
rpairs_epiwt <- epiWeights(rpairs_jar)
getPairs(rpairs_epiwt, max.weight = Inf, min.weight = -Inf)
id product Weight
1 1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
2 1 195/75 R16C lid CORDIANT Business CA 0.6135377
3
4 2 Something else
5 2 Different Product 0.4827264
6
7 1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
8 2 Different Product 0.4586156
9
10 2 Something else
11 1 195/75 R16C lid CORDIANT Business CA 0.4320106
因此,这会导致两行匹配的概率。如您所见,您要匹配的行 return 权重最高。