如何从另一个 table 的列中获取最接近的匹配字符串和分数?

How to get nearest matching string along with score from column from another table?

我试图通过使用 "stringdist" 包和 method = jw.(Jaro-winkler)

来获得最接近的匹配字符串和分数

第一个数据框 (df_1) 包含 2 列,我想从 str_2df_2 中获取最近的字符串并为该匹配项打分。
我浏览了整个包并找到了一些解决方案,我将在下面展示:

    year = c(2001,2001,2002,2003,2005,2006)
    str_1 =c("The best ever Puma wishlist","I finalised on buy a top from Myntra","Its perfect for a day at gym",
             "Check out PUMA Unisex Black Running","i have been mailing my issue daily","xyz")
    
    df_1 = data.frame(year,str_1)
    
    ID = c(100,211,155,367,678,2356,927,829,397)
    str_2 = c("VeRy4G3c7X","i have been mailing my issue twice","I finalised on buy a top from jobong",
              "Qm9dZzJdms","Check out PUMA Unisex Black Running","The best Puma wishlist","Its a day at gym",
              "fOpBRWCdSh","")

    df_2 = data.frame(ID,str_2)

我需要从 df_2str_2 列中获取最接近的匹配项,最终的 table 如下所示:

    stringdist(  a,  b,  method = c( "jw")

    df_1$Nearest_matching = c("The best Puma wishlist","I finalised on buy a top from jobong","Its a day at gym","Check out PUMA Unisex Black Running","i have been mailing my issue twice",NA) 
    df_1$Nearest_matching_score =c(0.099,0.092,0.205,0,0.078,NA).

这是一种为 df_1$str_1.

中的每个值找到最接近匹配和得分的方法
library(dplyr)
library(purrr)
library(stringdist)

result <- bind_cols(df_1, map_df(df_1$str_1, function(x) {
  vals <- stringdist(x, df_2$str_2,  method = 'jw')
  data.frame(Nearest_matching =  df_2$str_2[which.min(vals)],
             Nearest_matching_score = max(vals))
}))

#  year                                str_1
#1 2001          The best ever Puma wishlist
#2 2001 I finalised on buy a top from Myntra
#3 2002         Its perfect for a day at gym
#4 2003  Check out PUMA Unisex Black Running
#5 2005   i have been mailing my issue daily
#6 2006                                  xyz

#                      Nearest_matching Nearest_matching_score
#1               The best Puma wishlist              0.7419753
#2 I finalised on buy a top from jobong              0.7481481
#3                     Its a day at gym              0.7428571
#4  Check out PUMA Unisex Black Running              0.6238095
#5   i have been mailing my issue twice              0.6235294
#6                           VeRy4G3c7X              1.0000000

这是我根据 stringdist 包的文档得出的结果:

首先我在 str_1 和 str_2 之间创建了一个距离矩阵,然后我像这样为其分配了列名:

nearest_matching <- stringdistmatrix(df_1$str_1,df_2$str_2,  method = "jw")
colnames(nearest_matching) <- str_2

然后我从每行中选择最小值(距离)。

apply(nearest_matching, 1, FUN = min)

输出:

> apply(nearest_matching, 1, FUN = min)
[1] 0.09960718 0.09259259 0.20535714 0.00000000 0.07843137 0.52222222

最后,我写出了这些值对应的列名:

colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]

输出:

> colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
[1] "The best Puma wishlist"               "I finalised on buy a top from jobong" "Its a day at gym"                    
[4] "Check out PUMA Unisex Black Running"  "i have been mailing my issue twice"   "VeRy4G3c7X"