计算列中词典单词的频率并生成新的 "dictfreq" 列
Count frequency of dictionary words within a column and generate new "dictfreq" column
似乎是一个简单的命令,但我似乎找不到在 R 中生成此命令的好方法。基本上,我只想计算字典中每个单词的频率,dict,在另一个数据框的列 wordsgov 中:
dict = "apple", "pineapple","pear"
df$wordsgov = "i hate apple", "i hate apple", "i love pear", "i don't like pear", "pear is okay", "i eat pineapple sometimes"
期望的输出:新的频率排名,根据 df$wordsgov
中的频率显示字典中的所有单词
dict freq_gov
"pear" : 3
"apple": 2
"pineapple: 1
我尝试了下面的代码,但它给了我字典单词在 df$wordgov 的每一行中出现的次数,这不是我想要的:
dictongov <- within(
df,
counts <- sapply(
gregexpr(paste0(dict, collapse = "|"), wordsgov),
function(x) sum(x > 0)
)
)
我似乎无法弄清楚如何更改函数,以便它为我提供 dict$wordsgov 上字典中每个单词的出现频率。我试过 str_detect 但它也不起作用。任何帮助都将不胜感激!!!
--
编辑:
我使用了以下,效果很好。
dictfreq <- df %>% mutate(dict = str_c(str_extract(wordsgov, str_c(dict, collapse = '|')), ':')) %>%
count(dict, name = 'freq_gov') %>% arrange(desc(freq_gov))
但是,它把频率为0的词都去掉了,请问有什么办法可以保留频率为0的词吗?我试过“.drop=FALSE”,但它似乎在这段代码中不起作用。任何帮助将非常感激。谢谢!
我们也可以用 str_count
library(stringr)
library(purrr)
out <- map_int(str_c("\b", v2, "\b"), ~ sum(str_count(v1, .x)))
out
#[1] 2 1 3
rank(out)
数据
v1 <- c("i hate apple", "i hate apple", "i love pear", "i don't like pear",
"pear is okay", "i eat pineapple sometimes")
v2 <- c("apple", "pineapple", "pear")
似乎是一个简单的命令,但我似乎找不到在 R 中生成此命令的好方法。基本上,我只想计算字典中每个单词的频率,dict,在另一个数据框的列 wordsgov 中:
dict = "apple", "pineapple","pear"
df$wordsgov = "i hate apple", "i hate apple", "i love pear", "i don't like pear", "pear is okay", "i eat pineapple sometimes"
期望的输出:新的频率排名,根据 df$wordsgov
中的频率显示字典中的所有单词dict freq_gov
"pear" : 3
"apple": 2
"pineapple: 1
我尝试了下面的代码,但它给了我字典单词在 df$wordgov 的每一行中出现的次数,这不是我想要的:
dictongov <- within(
df,
counts <- sapply(
gregexpr(paste0(dict, collapse = "|"), wordsgov),
function(x) sum(x > 0)
)
)
我似乎无法弄清楚如何更改函数,以便它为我提供 dict$wordsgov 上字典中每个单词的出现频率。我试过 str_detect 但它也不起作用。任何帮助都将不胜感激!!!
-- 编辑: 我使用了以下,效果很好。
dictfreq <- df %>% mutate(dict = str_c(str_extract(wordsgov, str_c(dict, collapse = '|')), ':')) %>%
count(dict, name = 'freq_gov') %>% arrange(desc(freq_gov))
但是,它把频率为0的词都去掉了,请问有什么办法可以保留频率为0的词吗?我试过“.drop=FALSE”,但它似乎在这段代码中不起作用。任何帮助将非常感激。谢谢!
我们也可以用 str_count
library(stringr)
library(purrr)
out <- map_int(str_c("\b", v2, "\b"), ~ sum(str_count(v1, .x)))
out
#[1] 2 1 3
rank(out)
数据
v1 <- c("i hate apple", "i hate apple", "i love pear", "i don't like pear",
"pear is okay", "i eat pineapple sometimes")
v2 <- c("apple", "pineapple", "pear")