识别文档中的哪些单词已通过字典查找匹配以及匹配了多少次
Identify WHICH words in a document have been matched by dictionary lookup and how many times
Quanteda 问题。
对于语料库中的每个文档,我试图找出词典类别中的哪些单词对该类别的总计数有贡献,以及有多少。
换句话说,我想获得每个字典类别中使用 tokens_lookup 和 dfm_lookup 函数匹配的特征矩阵,以及它们在每个文档中的出现频率。所以不是类别中所有单词的总频率,而是每个单词的总频率。
有没有简单的方法可以得到这个?
最简单的方法是遍历您的字典“键”(您称之为“类别”)和 select 匹配项,为每个键创建一个 dfm。需要几个步骤来处理不匹配项和复合字典值(例如“not fail”)。
我可以使用内置的就职演说语料库和 LSD2015 词典来证明这一点,该词典有四个键并包含多词值。
循环遍历字典键以构建列表,每次执行以下操作:
- select 代币,但为未 selected 的代币留个垫子;
- 将多词标记合成为单个标记;
- 将pad (
""
) 重命名为OTHER
,这样我们就可以统计不匹配项了;和
- 创建 dfm。
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
现在我们拥有 dfm 对象列表中每个键的所有字典匹配项:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658
Quanteda 问题。
对于语料库中的每个文档,我试图找出词典类别中的哪些单词对该类别的总计数有贡献,以及有多少。
换句话说,我想获得每个字典类别中使用 tokens_lookup 和 dfm_lookup 函数匹配的特征矩阵,以及它们在每个文档中的出现频率。所以不是类别中所有单词的总频率,而是每个单词的总频率。
有没有简单的方法可以得到这个?
最简单的方法是遍历您的字典“键”(您称之为“类别”)和 select 匹配项,为每个键创建一个 dfm。需要几个步骤来处理不匹配项和复合字典值(例如“not fail”)。
我可以使用内置的就职演说语料库和 LSD2015 词典来证明这一点,该词典有四个键并包含多词值。
循环遍历字典键以构建列表,每次执行以下操作:
- select 代币,但为未 selected 的代币留个垫子;
- 将多词标记合成为单个标记;
- 将pad (
""
) 重命名为OTHER
,这样我们就可以统计不匹配项了;和 - 创建 dfm。
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
现在我们拥有 dfm 对象列表中每个键的所有字典匹配项:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658