如何按组计算多个字符串出现在另一个字符串列表中的数量?

How to count the number multiple strings appear in another string list by group?

现在我有两个数据,nametext,我想计算每个名字在[=中出现的次数text中的当年13=]name,即生成数据result。如何做到这一点?我尝试了 lapplygrepl,但都失败了。非常感谢!

name=data.table(year=c(2018,2019,2020),
                  name0=list(c("A","B","C"),c("B","C"),c("D","E","F")))
text=data.table(year=c(2018,2018,2019,2019,2020),
                text0=list(c("DEF","BG","CG"),c("ART","CWW"),c("DLK","BU","FO"),
                           c("A45","11B","C23"),c("EIU","CM")))
result=data.table(year=c(2018,2018,2018,2019,2019,2020,2020,2020),
                 name0=c("A","B","C","B","C","D","E","F"),
                 count=c(1,1,2,2,1,0,1,0))

合并未列出的值将起作用:

library(data.table)
merge(
  name[, .(name0 = unlist(name0)), by = .(year)],
  text[, .(name0 = unlist(strsplit(unlist(text0), ""))), by=.(year)][, ign := 1],
  by = c("year", "name0"), all.x = TRUE, allow.cartesian = TRUE
)[,.(count = sum(!is.na(ign))), by = .(year, name0)]
#     year  name0 count
#    <num> <char> <int>
# 1:  2018      A     1
# 2:  2018      B     1
# 3:  2018      C     2
# 4:  2019      B     2
# 5:  2019      C     1
# 6:  2020      D     0
# 7:  2020      E     1
# 8:  2020      F     0

ign 变量使我们可以强制 all.x=TRUE 但考虑到那些在 y.

中找不到的变量

速度较慢但可能更节省内存的方法:

namelong <- name[, .(name0 = unlist(name0)), by = .(year)]
namelong
#     year  name0
#    <num> <char>
# 1:  2018      A
# 2:  2018      B
# 3:  2018      C
# 4:  2019      B
# 5:  2019      C
# 6:  2020      D
# 7:  2020      E
# 8:  2020      F

func <- function(yr, nm) text[year == yr, sum(grepl(nm, unlist(text0)))]
namelong[, count := do.call(mapply, c(list(FUN=func), unname(namelong)))]
#     year  name0 count
#    <num> <char> <int>
# 1:  2018      A     1
# 2:  2018      B     1
# 3:  2018      C     2
# 4:  2019      B     2
# 5:  2019      C     1
# 6:  2020      D     0
# 7:  2020      E     1
# 8:  2020      F     0