R：使用 grep 和 tm 包部分匹配字典术语

Question

你好：我有一本别人编的否定词词典。我不确定他们是如何进行词干提取的，但看起来他们使用了 Porter Stemer 以外的东西。该词典有一个通配符 (*)，我认为它应该能够使词干发生。但我不知道如何在 R 上下文中将其与 grep() 或 tm 包一起使用，所以我将其剥离，希望找到一种方法来 grep 部分匹配。所以原来的字典是这样的

#load libraries
library(tm)
#sample dictionary terms for polarize and outlaw
negative<-c('polariz*', 'outlaw*')
#strip out wildcard
negative<-gsub('*', '', negative)
#test corpus
test<-c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')
#Here is how R's porter stemmer stems the text
stemDocument(test)

所以，如果我用 R 的词干分析器对我的语料库进行词干提取，像 'outlaw' 这样的术语会在字典中找到，但它不会匹配像 'polarized' 这样的术语，因为它们会被词干提取和字典里查到的不一样。

所以，我想要的是让 tm 包只匹配每个单词的确切部分的方法。因此，在不阻止我的文档的情况下，我希望它能够在术语 'outlawing' 和 'outlaws' 中挑选出 'outlaw' 并在 [=19= 中挑选出 'polariz' ], '极化和 'polarizes'。这可能吗？

#Define corpus
test.corp<-Corpus(VectorSource(test))  
#make Document Term Matrix
dtm<-documentTermMatrix(test.corp, control=list(dictionary=negative))
#inspect
inspect(dtm)

Answer 1

我还没有看到任何 tm 的答案，所以这里有一个使用 quanteda 包作为替代。它允许您在字典条目中使用“glob”通配符值，这是 quanteda 的 字典函数的默认 valuetype。（参见 ?dictionary。）使用这种方法，您无需对文本进行词干处理。

library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.6.2’

# create a quanteda dictionary, essentially a named list
negative <- dictionary(list(polariz = 'polariz*', outlaw = 'outlaw*'))
negative
## Dictionary object with 2 key entries.
##  - polariz: polariz*
##  - outlaw: outlaw*

test <- c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')

dfm(test, dictionary = negative, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 7 documents, 2 features.
## 7 x 2 sparse Matrix of class "dfmSparse"
##        features
## docs    polariz outlaw
##   text1       1      0
##   text3       1      0
##   text2       1      0
##   text4       1      0
##   text5       0      1
##   text6       0      1
##   text7       0      1

R：使用 grep 和 tm 包部分匹配字典术语

R: partial match dictionary terms using grep and tm package

dictionary

r

text-mining

tm