计算列中词典单词的频率并生成新的 "dictfreq" 列

Question

似乎是一个简单的命令，但我似乎找不到在 R 中生成此命令的好方法。基本上，我只想计算字典中每个单词的频率，dict，在另一个数据框的列 wordsgov 中：

dict = "apple", "pineapple","pear"
df$wordsgov = "i hate apple", "i hate apple", "i love pear", "i don't like pear", "pear is okay", "i eat pineapple sometimes"

期望的输出：新的频率排名，根据 df$wordsgov

中的频率显示字典中的所有单词

dict    freq_gov
"pear" : 3
"apple": 2
"pineapple: 1

我尝试了下面的代码，但它给了我字典单词在 df$wordgov 的每一行中出现的次数，这不是我想要的：

dictongov <- within(
  df,
  counts <- sapply(
    gregexpr(paste0(dict, collapse = "|"), wordsgov),
    function(x) sum(x > 0)
  )
)

我似乎无法弄清楚如何更改函数，以便它为我提供 dict$wordsgov 上字典中每个单词的出现频率。我试过 str_detect 但它也不起作用。任何帮助都将不胜感激！！！

-- 编辑：我使用了以下，效果很好。

dictfreq <- df %>% mutate(dict = str_c(str_extract(wordsgov, str_c(dict, collapse = '|')), ':')) %>% 
                   count(dict, name = 'freq_gov') %>% arrange(desc(freq_gov))

但是，它把频率为0的词都去掉了，请问有什么办法可以保留频率为0的词吗？我试过“.drop=FALSE”，但它似乎在这段代码中不起作用。任何帮助将非常感激。谢谢！

Answer 1

我们也可以用 str_count

library(stringr)
library(purrr)
out <- map_int(str_c("\b", v2, "\b"), ~  sum(str_count(v1, .x)))
out
#[1] 2 1 3

rank(out)

数据

v1 <- c("i hate apple", "i hate apple", "i love pear", "i don't like pear", 
       "pear is okay", "i eat pineapple sometimes")

v2 <- c("apple", "pineapple", "pear")

计算列中词典单词的频率并生成新的 "dictfreq" 列

Count frequency of dictionary words within a column and generate new "dictfreq" column

dictionary

r

word-frequency

数据