Grepl 组字符串和所有使用 R 的计数频率
Grepl group of strings and count frequency of all using R
我有一列 50k 行的名为文本的推文来自 csv 文件(推文由句子、短语等组成)。我正在尝试计算该列中几个单词的频率。与我在下面所做的相比,是否有更简单的方法来做到这一点?
# Reading my file
tweets <- read.csv('coffee.csv', header=TRUE)
# Doing a grepl per word (This is hard because I need to look for many words one by one)
coffee <- grepl("coffee", text$tweets, ignore.case=TRUE)
mugs <- grepl("mugs", text$tweets, ignore.case=TRUE)
# Calculate the % of times among all tweets (This is hard because I need to calculate one by one)
sum(coffee) / nrow(text)
sum(starbucks) / nrow(text)
预期输出(假设上面有超过 2 个词)
Word Freq
coffee 50
mugs 40
cup 64
pen 12
您可以创建一个包含要计算 frequency/percentage 的单词的向量,然后使用 sapply
来计算它们。
words <- c('coffee', 'mugs')
data.frame(words, t(sapply(paste0('\b', words, '\b'), function(x) {
tmp <- grepl(x, tweets$text)
c(perc = mean(tmp) * 100,
Freq = sum(tmp))
})), row.names = NULL) -> result
result
# words perc Freq
#1 coffee 33.33333 1
#2 mugs 66.66667 2
sapply
类似于 for
循环,因为它遍历 words
中定义的每个单词。 grepl
returns TRUE
/FALSE
值指示单词是否存在于 tweets$text
中,而存储在 tmp
中。我们使用 sum
计算频率,使用 mean
计算百分比。还为单词添加了单词边界 (\b
),以便它们在 text
中完全匹配,因此 'coffee'
与 'coffees'
等不匹配
数据
tweets <- data.frame(text = c('This is text with coffee in it with lot of mugs',
'This has only mugs',
'This has nothing'))
我有一列 50k 行的名为文本的推文来自 csv 文件(推文由句子、短语等组成)。我正在尝试计算该列中几个单词的频率。与我在下面所做的相比,是否有更简单的方法来做到这一点?
# Reading my file
tweets <- read.csv('coffee.csv', header=TRUE)
# Doing a grepl per word (This is hard because I need to look for many words one by one)
coffee <- grepl("coffee", text$tweets, ignore.case=TRUE)
mugs <- grepl("mugs", text$tweets, ignore.case=TRUE)
# Calculate the % of times among all tweets (This is hard because I need to calculate one by one)
sum(coffee) / nrow(text)
sum(starbucks) / nrow(text)
预期输出(假设上面有超过 2 个词)
Word Freq
coffee 50
mugs 40
cup 64
pen 12
您可以创建一个包含要计算 frequency/percentage 的单词的向量,然后使用 sapply
来计算它们。
words <- c('coffee', 'mugs')
data.frame(words, t(sapply(paste0('\b', words, '\b'), function(x) {
tmp <- grepl(x, tweets$text)
c(perc = mean(tmp) * 100,
Freq = sum(tmp))
})), row.names = NULL) -> result
result
# words perc Freq
#1 coffee 33.33333 1
#2 mugs 66.66667 2
sapply
类似于 for
循环,因为它遍历 words
中定义的每个单词。 grepl
returns TRUE
/FALSE
值指示单词是否存在于 tweets$text
中,而存储在 tmp
中。我们使用 sum
计算频率,使用 mean
计算百分比。还为单词添加了单词边界 (\b
),以便它们在 text
中完全匹配,因此 'coffee'
与 'coffees'
等不匹配
数据
tweets <- data.frame(text = c('This is text with coffee in it with lot of mugs',
'This has only mugs',
'This has nothing'))