在 R 中,如何计算语料库中的特定单词?

In R, how can I count specific words in a corpus?

我需要计算特定单词的出现频率。很多话。我知道如何通过将所有单词放在一组中来做到这一点(见下文),但我想获得每个特定单词的计数。

这是我目前拥有的:

library(quanteda)
#function to count 
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')

如您所见,“云”和“风暴”的计数在结果 data.frame 的“all_terms”类别中。有没有一种简单的方法可以在各个列中获取“mydict”中所有术语的计数,而无需为每个单独的术语编写代码?

E.g.
clouds, storms
1, 1

Rather than 
all_terms
2

您可以结合使用 tidytext 的 unnest_tokens() 函数和 tidyr 的 pivot_wider() 来获取单独列中每个单词的计数:

library(dplyr)
library(tidytext)
library(tidyr)

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- c("clouds","storms")

df <- data.frame(text = txt) %>% 
  unnest_tokens(word, text) %>%
  count(word) %>% 
  pivot_wider(names_from = word, values_from = n)

df %>% select(mydict)

# A tibble: 1 x 2
  clouds storms
   <int>  <int>
1      1      1

您想将字典值用作 tokens_select() 中的 pattern,而不是在查找函数中使用它们,这正是 dfm(x, dictionary = ...) 所做的。方法如下:

library("quanteda")
## Package version: 2.1.2

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- dictionary(list(all_terms = c("clouds", "storms")))

这将创建 dfm,其中每一列都是术语,而不是字典键:

dfmat <- tokens(txt) %>%
  tokens_select(mydict) %>%
  dfm()

dfmat
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1

您可以通过两种方式将其转换为 data.frame 计数:

convert(dfmat, to = "data.frame")
##   doc_id clouds storms
## 1  text1      1      1

textstat_frequency(dfmat)
##   feature frequency rank docfreq group
## 1  clouds         1    1       1   all
## 2  storms         1    1       1   all

虽然字典是 pattern 的有效输入(参见 ?pattern),但您也可以将值的字符向量输入 tokens_select():

# no need for dictionary
tokens(txt) %>%
  tokens_select(c("clouds", "storms")) %>%
  dfm()
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1