R data.table - 字符串的快速比较

Question

我想找到以下问题的快速解决方案。例子很小，实际数据很大，速度是一个重要因素。

我有两个字符串向量，目前在 data.tables 中，但这并不重要。我需要从一个向量中找到字符串在第二个向量中出现的频率并保留这些结果。

示例

library(data.table)

dt1<-data.table(c("first","second","third"),c(0,0,0))
dt2<-data.table(c("first and second","third and fifth","second and no else","first and second and third"))

现在，对于 dt1 中的每个项目，我需要查找 dt2 中包含的项目数量，并将最终频率保存到 dt1 的第二列。任务本身并不困难。但是，我未能找到相当快速的解决方案。

我现在的解决方案是这样的：

pm<-proc.time()
for (l in 1:dim(dt2)[1]) {
    for (k in 1:dim(dt1)[1]) set(dt1,k,2L,dt1[k,V2]+as.integer(grepl(dt1[k,V1],dt2[l,V1])))
}
proc.time() - pm

实际数据非常大，这很慢，在我的电脑上，即使是这个更大的版本也需要 2 秒

dt1<-data.table(rep(c("first","second","third"),10),rep(c(0,0,0),10))
dt2<-data.table(rep(c("first and second","third and fifth","second and no else","first and second and third"),10))

pm<-proc.time()
for (l in 1:dim(dt2)[1]) {
    for (k in 1:dim(dt1)[1]) set(dt1,k,2L,dt1[k,V2]+as.integer(grepl(dt1[k,V1],dt2[l,V1])))
}
proc.time() - pm

   user  system elapsed 
   1.93    0.06    2.06

我是否错过了更好的解决方案 - 我会说非常简单 - 任务？实际上它非常简单，我确信它一定是重复的，但我没有设法在这里找到它或任何等效的东西。

由于内存问题（在实际情况下），无法交叉合并 data.tables。谢谢。

Answer 1

dt1[, V2 := sapply(V1, function(x) sum(grepl(x, dt2$V1)))]

您也可以使用固定字符串匹配来提高速度。在这种情况下，您可以使用 stringi 包中的 stri_detect_fixed：

dt1[, V2 := sapply(V1, function(x) sum(stri_detect_fixed(dt2$V1, x)))]

R data.table - 字符串的快速比较

R data.table - quick comparison of strings

r

grepl

data.table