如何从另一个 table 的列中获取最接近的匹配字符串和分数?
How to get nearest matching string along with score from column from another table?
我试图通过使用 "stringdist"
包和 method = jw.(Jaro-winkler)
来获得最接近的匹配字符串和分数
第一个数据框 (df_1
) 包含 2 列,我想从 str_2
和 df_2
中获取最近的字符串并为该匹配项打分。
我浏览了整个包并找到了一些解决方案,我将在下面展示:
year = c(2001,2001,2002,2003,2005,2006)
str_1 =c("The best ever Puma wishlist","I finalised on buy a top from Myntra","Its perfect for a day at gym",
"Check out PUMA Unisex Black Running","i have been mailing my issue daily","xyz")
df_1 = data.frame(year,str_1)
ID = c(100,211,155,367,678,2356,927,829,397)
str_2 = c("VeRy4G3c7X","i have been mailing my issue twice","I finalised on buy a top from jobong",
"Qm9dZzJdms","Check out PUMA Unisex Black Running","The best Puma wishlist","Its a day at gym",
"fOpBRWCdSh","")
df_2 = data.frame(ID,str_2)
我需要从 df_2
的 str_2
列中获取最接近的匹配项,最终的 table 如下所示:
stringdist( a, b, method = c( "jw")
df_1$Nearest_matching = c("The best Puma wishlist","I finalised on buy a top from jobong","Its a day at gym","Check out PUMA Unisex Black Running","i have been mailing my issue twice",NA)
df_1$Nearest_matching_score =c(0.099,0.092,0.205,0,0.078,NA).
这是一种为 df_1$str_1
.
中的每个值找到最接近匹配和得分的方法
library(dplyr)
library(purrr)
library(stringdist)
result <- bind_cols(df_1, map_df(df_1$str_1, function(x) {
vals <- stringdist(x, df_2$str_2, method = 'jw')
data.frame(Nearest_matching = df_2$str_2[which.min(vals)],
Nearest_matching_score = max(vals))
}))
# year str_1
#1 2001 The best ever Puma wishlist
#2 2001 I finalised on buy a top from Myntra
#3 2002 Its perfect for a day at gym
#4 2003 Check out PUMA Unisex Black Running
#5 2005 i have been mailing my issue daily
#6 2006 xyz
# Nearest_matching Nearest_matching_score
#1 The best Puma wishlist 0.7419753
#2 I finalised on buy a top from jobong 0.7481481
#3 Its a day at gym 0.7428571
#4 Check out PUMA Unisex Black Running 0.6238095
#5 i have been mailing my issue twice 0.6235294
#6 VeRy4G3c7X 1.0000000
这是我根据 stringdist
包的文档得出的结果:
首先我在 str_1 和 str_2 之间创建了一个距离矩阵,然后我像这样为其分配了列名:
nearest_matching <- stringdistmatrix(df_1$str_1,df_2$str_2, method = "jw")
colnames(nearest_matching) <- str_2
然后我从每行中选择最小值(距离)。
apply(nearest_matching, 1, FUN = min)
输出:
> apply(nearest_matching, 1, FUN = min)
[1] 0.09960718 0.09259259 0.20535714 0.00000000 0.07843137 0.52222222
最后,我写出了这些值对应的列名:
colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
输出:
> colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
[1] "The best Puma wishlist" "I finalised on buy a top from jobong" "Its a day at gym"
[4] "Check out PUMA Unisex Black Running" "i have been mailing my issue twice" "VeRy4G3c7X"
我试图通过使用 "stringdist"
包和 method = jw.(Jaro-winkler)
第一个数据框 (df_1
) 包含 2 列,我想从 str_2
和 df_2
中获取最近的字符串并为该匹配项打分。
我浏览了整个包并找到了一些解决方案,我将在下面展示:
year = c(2001,2001,2002,2003,2005,2006)
str_1 =c("The best ever Puma wishlist","I finalised on buy a top from Myntra","Its perfect for a day at gym",
"Check out PUMA Unisex Black Running","i have been mailing my issue daily","xyz")
df_1 = data.frame(year,str_1)
ID = c(100,211,155,367,678,2356,927,829,397)
str_2 = c("VeRy4G3c7X","i have been mailing my issue twice","I finalised on buy a top from jobong",
"Qm9dZzJdms","Check out PUMA Unisex Black Running","The best Puma wishlist","Its a day at gym",
"fOpBRWCdSh","")
df_2 = data.frame(ID,str_2)
我需要从 df_2
的 str_2
列中获取最接近的匹配项,最终的 table 如下所示:
stringdist( a, b, method = c( "jw")
df_1$Nearest_matching = c("The best Puma wishlist","I finalised on buy a top from jobong","Its a day at gym","Check out PUMA Unisex Black Running","i have been mailing my issue twice",NA)
df_1$Nearest_matching_score =c(0.099,0.092,0.205,0,0.078,NA).
这是一种为 df_1$str_1
.
library(dplyr)
library(purrr)
library(stringdist)
result <- bind_cols(df_1, map_df(df_1$str_1, function(x) {
vals <- stringdist(x, df_2$str_2, method = 'jw')
data.frame(Nearest_matching = df_2$str_2[which.min(vals)],
Nearest_matching_score = max(vals))
}))
# year str_1
#1 2001 The best ever Puma wishlist
#2 2001 I finalised on buy a top from Myntra
#3 2002 Its perfect for a day at gym
#4 2003 Check out PUMA Unisex Black Running
#5 2005 i have been mailing my issue daily
#6 2006 xyz
# Nearest_matching Nearest_matching_score
#1 The best Puma wishlist 0.7419753
#2 I finalised on buy a top from jobong 0.7481481
#3 Its a day at gym 0.7428571
#4 Check out PUMA Unisex Black Running 0.6238095
#5 i have been mailing my issue twice 0.6235294
#6 VeRy4G3c7X 1.0000000
这是我根据 stringdist
包的文档得出的结果:
首先我在 str_1 和 str_2 之间创建了一个距离矩阵,然后我像这样为其分配了列名:
nearest_matching <- stringdistmatrix(df_1$str_1,df_2$str_2, method = "jw")
colnames(nearest_matching) <- str_2
然后我从每行中选择最小值(距离)。
apply(nearest_matching, 1, FUN = min)
输出:
> apply(nearest_matching, 1, FUN = min)
[1] 0.09960718 0.09259259 0.20535714 0.00000000 0.07843137 0.52222222
最后,我写出了这些值对应的列名:
colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
输出:
> colnames(nearest_matching)[apply(nearest_matching, 1, FUN = which.min)]
[1] "The best Puma wishlist" "I finalised on buy a top from jobong" "Its a day at gym"
[4] "Check out PUMA Unisex Black Running" "i have been mailing my issue twice" "VeRy4G3c7X"