如何按组计算多个字符串出现在另一个字符串列表中的数量?
How to count the number multiple strings appear in another string list by group?
现在我有两个数据,name和text,我想计算每个名字在[=中出现的次数text中的当年13=]name,即生成数据result。如何做到这一点?我尝试了 lapply 和 grepl,但都失败了。非常感谢!
name=data.table(year=c(2018,2019,2020),
name0=list(c("A","B","C"),c("B","C"),c("D","E","F")))
text=data.table(year=c(2018,2018,2019,2019,2020),
text0=list(c("DEF","BG","CG"),c("ART","CWW"),c("DLK","BU","FO"),
c("A45","11B","C23"),c("EIU","CM")))
result=data.table(year=c(2018,2018,2018,2019,2019,2020,2020,2020),
name0=c("A","B","C","B","C","D","E","F"),
count=c(1,1,2,2,1,0,1,0))
合并未列出的值将起作用:
library(data.table)
merge(
name[, .(name0 = unlist(name0)), by = .(year)],
text[, .(name0 = unlist(strsplit(unlist(text0), ""))), by=.(year)][, ign := 1],
by = c("year", "name0"), all.x = TRUE, allow.cartesian = TRUE
)[,.(count = sum(!is.na(ign))), by = .(year, name0)]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0
ign
变量使我们可以强制 all.x=TRUE
但考虑到那些在 y
.
中找不到的变量
速度较慢但可能更节省内存的方法:
namelong <- name[, .(name0 = unlist(name0)), by = .(year)]
namelong
# year name0
# <num> <char>
# 1: 2018 A
# 2: 2018 B
# 3: 2018 C
# 4: 2019 B
# 5: 2019 C
# 6: 2020 D
# 7: 2020 E
# 8: 2020 F
func <- function(yr, nm) text[year == yr, sum(grepl(nm, unlist(text0)))]
namelong[, count := do.call(mapply, c(list(FUN=func), unname(namelong)))]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0
现在我有两个数据,name和text,我想计算每个名字在[=中出现的次数text中的当年13=]name,即生成数据result。如何做到这一点?我尝试了 lapply 和 grepl,但都失败了。非常感谢!
name=data.table(year=c(2018,2019,2020),
name0=list(c("A","B","C"),c("B","C"),c("D","E","F")))
text=data.table(year=c(2018,2018,2019,2019,2020),
text0=list(c("DEF","BG","CG"),c("ART","CWW"),c("DLK","BU","FO"),
c("A45","11B","C23"),c("EIU","CM")))
result=data.table(year=c(2018,2018,2018,2019,2019,2020,2020,2020),
name0=c("A","B","C","B","C","D","E","F"),
count=c(1,1,2,2,1,0,1,0))
合并未列出的值将起作用:
library(data.table)
merge(
name[, .(name0 = unlist(name0)), by = .(year)],
text[, .(name0 = unlist(strsplit(unlist(text0), ""))), by=.(year)][, ign := 1],
by = c("year", "name0"), all.x = TRUE, allow.cartesian = TRUE
)[,.(count = sum(!is.na(ign))), by = .(year, name0)]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0
ign
变量使我们可以强制 all.x=TRUE
但考虑到那些在 y
.
速度较慢但可能更节省内存的方法:
namelong <- name[, .(name0 = unlist(name0)), by = .(year)]
namelong
# year name0
# <num> <char>
# 1: 2018 A
# 2: 2018 B
# 3: 2018 C
# 4: 2019 B
# 5: 2019 C
# 6: 2020 D
# 7: 2020 E
# 8: 2020 F
func <- function(yr, nm) text[year == yr, sum(grepl(nm, unlist(text0)))]
namelong[, count := do.call(mapply, c(list(FUN=func), unname(namelong)))]
# year name0 count
# <num> <char> <int>
# 1: 2018 A 1
# 2: 2018 B 1
# 3: 2018 C 2
# 4: 2019 B 2
# 5: 2019 C 1
# 6: 2020 D 0
# 7: 2020 E 1
# 8: 2020 F 0