识别文档中的哪些单词已通过字典查找匹配以及匹配了多少次

Identify WHICH words in a document have been matched by dictionary lookup and how many times

Quanteda 问题。

对于语料库中的每个文档,我试图找出词典类别中的哪些单词对该类别的总计数有贡献,以及有多少。

换句话说,我想获得每个字典类别中使用 tokens_lookup 和 dfm_lookup 函数匹配的特征矩阵,以及它们在每个文档中的出现频率。所以不是类别中所有单词的总频率,而是每个单词的总频率。

有没有简单的方法可以得到这个?

最简单的方法是遍历您的字典“键”(您称之为“类别”)和 select 匹配项,为每个键创建一个 dfm。需要几个步骤来处理不匹配项和复合字典值(例如“not fail”)。

我可以使用内置的就职演说语料库和 LSD2015 词典来证明这一点,该词典有四个键并包含多词值。

循环遍历字典键以构建列表,每次执行以下操作:

  • select 代币,但为未 selected 的代币留个垫子;
  • 将多词标记合成为单个标记;
  • 将pad ("") 重命名为OTHER,这样我们就可以统计不匹配项了;和
  • 创建 dfm。
library("quanteda")
## Package version: 2.1.0

toks <- tokens(tail(data_corpus_inaugural, 3))

dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
  this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
    tokens_compound(data_dictionary_LSD2015[key]) %>%
    tokens_replace("", "OTHER") %>%
    dfm(tolower = FALSE)
  dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)

现在我们拥有 dfm 对象列表中每个键的所有字典匹配项:

dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
##             features
## docs         clouds raging storms crisis war against violence hatred badly
##   2009-Obama      1      1      2      4   2       1        1      1     1
##   2013-Obama      0      1      1      1   3       1        0      0     0
##   2017-Trump      0      0      0      0   0       1        0      0     0
##             features
## docs         weakened
##   2009-Obama        1
##   2013-Obama        0
##   2017-Trump        0
## [ reached max_nfeat ... 170 more features ]
## 
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
##             features
## docs         grateful trust mindful thank well generosity cooperation
##   2009-Obama        1     2       1     1    2          1           2
##   2013-Obama        0     0       0     0    4          0           0
##   2017-Trump        1     0       0     1    0          0           0
##             features
## docs         prosperity peace skill
##   2009-Obama          3     4     1
##   2013-Obama          1     3     1
##   2017-Trump          1     0     0
## [ reached max_nfeat ... 246 more features ]
## 
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
##             features
## docs         not_apologize OTHER
##   2009-Obama             1  2687
##   2013-Obama             0  2317
##   2017-Trump             0  1660
## 
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
##             features
## docs         not_fight not_sap not_grudgingly not_fail OTHER
##   2009-Obama         0       0              1        0  2687
##   2013-Obama         1       1              0        0  2313
##   2017-Trump         0       0              0        1  1658