计算 R 中多词的词频?
Calculating word frequency for multi-words in R?
我正在尝试计算给定文本中多个单词的出现频率。例如,考虑文本:"Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology"。然后我想要组合词 "environmental research" 在文本中出现的次数。这是我试过的代码。
library(tm)
#Reading the data
text = readLines(file.choose())
text1 = Corpus(VectorSource(text))
#Cleaning the data
text1 = tm_map(text1, content_transformer(tolower))
text1 = tm_map(text1, removePunctuation)
text1 = tm_map(text1, removeNumbers)
text1 = tm_map(text1, stripWhitespace)
text1 = tm_map(text1, removeWords, stopwords("english"))
#Making a document matrix
dtm = TermDocumentMatrix(text1)
m11 = as.matrix(text1)
freq11 = sort(rowSums(m11), decreasing=TRUE)
d11 = data.frame(word=names(freq11), freq=freq11)
head(d11,9)
但是,此代码会分别生成每个单词的频率。相反,我如何获得 "environmental research" 在文本中一起出现的次数?谢谢!
如果你已经有了一个多词列表并且你想计算它们在文本中的频率,你可以使用 str_extract_all
:
text <- "Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology"
library(stringr)
str_extract_all(text, "[Ee]nvironmental [Rr]esearch")
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research"
如果您想知道多词出现的频率,您可以这样做:
length(unlist(str_extract_all(text, "[Ee]nvironmental [Rr]esearch")))
[1] 3
如果您有兴趣一次提取所有多词,您可以这样进行:
首先定义一个包含所有多词的向量:
multiwords <- c("[Ee]nvironmental [Rr]esearch", "study science energy")
然后使用 paste0
将它们折叠成一个单一的替代模式字符串,并在该字符串上使用 str_extract_all
:
str_extract_all(text, paste0(multiwords, collapse = "|"))
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research" "study science energy"
要获取多词的频率,您可以使用 table
:
table(str_extract_all(text, paste0(multiwords, collapse = "|")))
Environmental Research study science energy
3 1
我正在尝试计算给定文本中多个单词的出现频率。例如,考虑文本:"Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology"。然后我想要组合词 "environmental research" 在文本中出现的次数。这是我试过的代码。
library(tm)
#Reading the data
text = readLines(file.choose())
text1 = Corpus(VectorSource(text))
#Cleaning the data
text1 = tm_map(text1, content_transformer(tolower))
text1 = tm_map(text1, removePunctuation)
text1 = tm_map(text1, removeNumbers)
text1 = tm_map(text1, stripWhitespace)
text1 = tm_map(text1, removeWords, stopwords("english"))
#Making a document matrix
dtm = TermDocumentMatrix(text1)
m11 = as.matrix(text1)
freq11 = sort(rowSums(m11), decreasing=TRUE)
d11 = data.frame(word=names(freq11), freq=freq11)
head(d11,9)
但是,此代码会分别生成每个单词的频率。相反,我如何获得 "environmental research" 在文本中一起出现的次数?谢谢!
如果你已经有了一个多词列表并且你想计算它们在文本中的频率,你可以使用 str_extract_all
:
text <- "Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology"
library(stringr)
str_extract_all(text, "[Ee]nvironmental [Rr]esearch")
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research"
如果您想知道多词出现的频率,您可以这样做:
length(unlist(str_extract_all(text, "[Ee]nvironmental [Rr]esearch")))
[1] 3
如果您有兴趣一次提取所有多词,您可以这样进行:
首先定义一个包含所有多词的向量:
multiwords <- c("[Ee]nvironmental [Rr]esearch", "study science energy")
然后使用 paste0
将它们折叠成一个单一的替代模式字符串,并在该字符串上使用 str_extract_all
:
str_extract_all(text, paste0(multiwords, collapse = "|"))
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research" "study science energy"
要获取多词的频率,您可以使用 table
:
table(str_extract_all(text, paste0(multiwords, collapse = "|")))
Environmental Research study science energy
3 1