Grepl 组字符串和所有使用 R 的计数频率

Grepl group of strings and count frequency of all using R

我有一列 50k 行的名为文本的推文来自 csv 文件(推文由句子、短语等组成)。我正在尝试计算该列中几个单词的频率。与我在下面所做的相比,是否有更简单的方法来做到这一点?

# Reading my file
tweets <- read.csv('coffee.csv', header=TRUE)


# Doing a grepl per word (This is hard because I need to look for many words one by one)
coffee    <- grepl("coffee", text$tweets, ignore.case=TRUE)
mugs    <- grepl("mugs", text$tweets, ignore.case=TRUE)


# Calculate the % of times among all tweets (This is hard because I need to calculate one by one)

sum(coffee) / nrow(text)
sum(starbucks) / nrow(text)

预期输出(假设上面有超过 2 个词)

Word   Freq
coffee  50
mugs    40
cup     64
pen     12

您可以创建一个包含要计算 frequency/percentage 的单词的向量,然后使用 sapply 来计算它们。

words <- c('coffee', 'mugs')

data.frame(words, t(sapply(paste0('\b', words, '\b'), function(x) {
  tmp <- grepl(x, tweets$text)
  c(perc = mean(tmp) * 100, 
    Freq = sum(tmp))
})), row.names = NULL) -> result
result

#   words     perc Freq
#1 coffee 33.33333    1
#2   mugs 66.66667    2

sapply 类似于 for 循环,因为它遍历 words 中定义的每个单词。 grepl returns TRUE/FALSE 值指示单词是否存在于 tweets$text 中,而存储在 tmp 中。我们使用 sum 计算频率,使用 mean 计算百分比。还为单词添加了单词边界 (\b),以便它们在 text 中完全匹配,因此 'coffee''coffees' 等不匹配

数据

tweets <- data.frame(text = c('This is text with coffee in it with lot of mugs', 
                              'This has only mugs', 
                              'This has nothing'))