计算多个文本中的单词时输出错误

Question

我有 2 个数据集，一个包含 500 个不同的实体，其中测量了一些变量。另一个有 500 个文本，其中每个文本都属于第一个数据集中的实体。我想在这些文本中搜索 3 个关键字，并统计每个文本中关键字的总出现次数。

一些随机数据作为随机表示，keywords 是一个向量，texts 是一个包含文本的列表（我有一个列表，不知道我的示例列表在这里是否正确），df 是包含我的实体变量的数据框：

keywords <- c("ab", "cd", "ef")
texts <- as.list("ab is ef when ef is ef",
                 "something something nothing",
                 "cd is cd is ab is ab and ef")
var1 <- c("area1", "area2", "area3")
var2 <- c("15", "5", "23")
df <- data.frame(var1, var2)
colnames(df) <- c("location", "temperature")

这里的正确答案是关键字在第一篇文章中出现了 4 次，在第二篇文章中出现了 0 次，在第三篇文章中出现了 5 次。但是，当我尝试以下操作时，它给出了错误的输出：

df$count <- 0 # Store the results
# counting for all keywords
for(w in keywords){
  df$count <- 
    df$count + 
    grepl(w, texts, ignore.case = T)
 print(w)
}

df$count

关于我可以做什么的任何提示？最好有一些示例代码？

提前致谢

Answer 1

您的 texts 是一个列表。这是有原因的吗？而是将其设为矢量。

你也可以更轻松地数数。也许试试 stringr 包。那么你可以做

library(stringr)

keywords <- c("ab", "cd", "ef")
texts <- c("ab is ef when ef is ef",
                 "something something nothing",
                 "cd is cd is ab is ab and ef")

str_count(texts, "ab|cd|ef")

[1] 4 0 5

如果你不能按照上面的方式设置模式，你也可以去

str_count(texts, paste(keywords, collapse = "|"))

计算多个文本中的单词时输出错误

Wrong output when counting words in multiple texts

loops

r

counting

grepl