计算不在给定词典中的文本中的单词

Question

我如何查找和计算给定词典中没有的单词？

以下示例计算文本中每次出现特定词典单词（云和风暴）的次数。

library("quanteda")
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."   
mydict <- dictionary(list(all_terms = c("clouds", "storms")))
dfmat <- tokens(txt) %>%
  tokens_select(mydict) %>%
  dfm()
dfmat

输出：

docs    clouds storms
  text1      1      1

我怎样才能生成所有不在词典 (clouds/storms) 中的单词的计数？理想情况下排除停用词。

例如，期望的输出：

docs    Forty-four Americans ...
  text1      1      1

Answer 1

这是一个使用 setdiff() 函数的案例。这是一个示例，说明如何从您的示例中提取奥巴马（在 $2013-Obama 中）未被拜登（在 $2021-Biden 中）使用的单词：

diff <- setdiff(toks[[1]], toks[[3]])

Answer 2

当您查看 tokens_select (运行 ?tokens_select) 的帮助文件时，您可以看到第三个参数是 selection。默认值是 "keep"，但您想要的是 "remove"。由于这是常见的事情，因此还有一个专用的 tokens_remove 命令，我在下面使用它来删除停用词。

dfmat <- tokens(txt) %>%
  tokens_select(mydict, selection = "remove") %>%
  tokens_remove(stopwords::stopwords(language = "en")) %>% 
  dfm()
dfmat
#> Document-feature matrix of: 1 document, 38 features (0.00% sparse) and 0 docvars.
#>        features
#> docs    forty-four americans now taken presidential oath . words spoken rising
#>   text1          1         1   1     2            1    2 4     1      1      1
#> [ reached max_nfeat ... 28 more features ]

我想这就是你想要做的。

^{由 reprex package (v2.0.1)}

于 2021-12-28 创建

计算不在给定词典中的文本中的单词

Count words in texts that are NOT in a given dictionary

nlp

r

word-count

quanteda