计算列中词典单词的频率并生成新的 "dictfreq" 列

Count frequency of dictionary words within a column and generate new "dictfreq" column

似乎是一个简单的命令,但我似乎找不到在 R 中生成此命令的好方法。基本上,我只想计算字典中每个单词的频率,dict,在另一个数据框的列 wordsgov 中:

dict = "apple", "pineapple","pear"
df$wordsgov = "i hate apple", "i hate apple", "i love pear", "i don't like pear", "pear is okay", "i eat pineapple sometimes"

期望的输出:新的频率排名,根据 df$wordsgov

中的频率显示字典中的所有单词
dict    freq_gov
"pear" : 3
"apple": 2
"pineapple: 1

我尝试了下面的代码,但它给了我字典单词在 df$wordgov 的每一行中出现的次数,这不是我想要的:

dictongov <- within(
  df,
  counts <- sapply(
    gregexpr(paste0(dict, collapse = "|"), wordsgov),
    function(x) sum(x > 0)
  )
)

我似乎无法弄清楚如何更改函数,以便它为我提供 dict$wordsgov 上字典中每个单词的出现频率。我试过 str_detect 但它也不起作用。任何帮助都将不胜感激!!!

-- 编辑: 我使用了以下,效果很好。

dictfreq <- df %>% mutate(dict = str_c(str_extract(wordsgov, str_c(dict, collapse = '|')), ':')) %>% 
                   count(dict, name = 'freq_gov') %>% arrange(desc(freq_gov))

但是,它把频率为0的词都去掉了,请问有什么办法可以保留频率为0的词吗?我试过“.drop=FALSE”,但它似乎在这段代码中不起作用。任何帮助将非常感激。谢谢!

我们也可以用 str_count

library(stringr)
library(purrr)
out <- map_int(str_c("\b", v2, "\b"), ~  sum(str_count(v1, .x)))
out
#[1] 2 1 3

rank(out)

数据

v1 <- c("i hate apple", "i hate apple", "i love pear", "i don't like pear", 
       "pear is okay", "i eat pineapple sometimes")

v2 <- c("apple", "pineapple", "pear")