R 中的模糊合并

Question

如果两个对象在语义上不同，如何连接它们？

1.Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
2.195/75 R16C lid CORDIANT Business CA

但这是同一个产品，因为与其文章相符 195/75 R16С

和一个例子

1.185/75 R16C lid Forward Professional 156 ASHK tubeless
2.The tire `185/75 R16С` С-156

185/75 R16C

关于这个主题的新问题 R:Error in compare.linkage : Data sets have different format

Answer 1

所以这是一个使用 RecordLinkage package 的解决方案。我认为这可以满足您的需求。

示例数据：

library(tidyverse)
library(RecordLinkage)

df_01 <- tibble(
  product = c("Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season",
              "Something else")
)
df_02 <- tibble(
  product = c("195/75 R16C lid CORDIANT Business CA", 
              "Different Product")
)

下一部分的详细信息最好留给 RecordLinkage 文档：

rpairs_jar <- compare.linkage(df_01, df_02,
                              strcmp = c("product"),
                              strcmpfun = jarowinkler)

rpairs_epiwt <- epiWeights(rpairs_jar)

getPairs(rpairs_epiwt, max.weight = Inf, min.weight = -Inf)

   id                                                      product    Weight
1   1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season          
2   1                         195/75 R16C lid CORDIANT Business CA 0.6135377
3                                                                           
4   2                                               Something else          
5   2                                            Different Product 0.4827264
6                                                                           
7   1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season          
8   2                                            Different Product 0.4586156
9                                                                           
10  2                                               Something else          
11  1                         195/75 R16C lid CORDIANT Business CA 0.4320106

因此，这会导致两行匹配的概率。如您所见，您要匹配的行 return 权重最高。

R 中的模糊合并

Fuzzy merging in R

fuzzy-search

r