计算第二列 R 中一列中子字符串的出现次数 data.table

Question

如何拆分 a 列（按 space 字符）并计算 a 列中的任何子字符串在 b 列中存在的次数？

library(data.table)
library(stringr)

dt = data.table(
    a = c('one', 'one two', 'one two three'),
    b = c('zero', 'none_or One?' , 'onetwothree')
)

我尝试失败了：

dt[ , .(
    str_count( b,
        pattern = str_split( a , pattern = ' ' )
    )
) ]

Error in UseMethod("type") : 
  no applicable method for 'type' applied to an object of class "list"

我预计：

Answer 1

dict = paste0(unique(unlist(strsplit(dt$a, " "))), collapse="|")
f <- function(s,dict) {
  res = gregexpr(dict,s)[[1]]
  return(ifelse(res[1]==-1,as.integer(0),length(res)))
}
dt[,f(b, dict), by=.(1:nrow(dt))][,.(V1)]


   V1
1:  0
2:  1
3:  3

Answer 2

第一步是创建一个正则表达式，然后将正则表达式应用于每个对象。

试试这个：

dt [, parsed:= str_split(a, " ")]  
dt [, regex := lapply(parsed, function(x) paste0(x, collapse = "|"))]
dt [, V1    := mapply (function(x,y) {str_extract_all(x,y)[[1]] |> length()}, b, regex)]
dt [, .(V1)]

Answer 3

另外几个变体：

只有 data.table:

dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) grepl(x, txt))), strsplit(a, " "), b) ]
##[1] 0 1 3

与data.table和stringr:

dt[, mapply(\(sp,txt) sum(sapply(sp, \(x) str_detect(txt, x))), strsplit(a, " "), b) ]
##[1] 0 1 3

Answer 4

我想到了这个解决方案：

dt[, sum(sapply(tstrsplit(a, " "), function(x) grepl(x, b))), .(a, b)]

计算第二列 R 中一列中子字符串的出现次数 data.table

Count Occurrences of Substrings in One Column in Second Column R data.table

r

stringr

data.table