TermDocumentMatrix 进行未经请求的清理（例如删除标点符号）

Question

根据我对文档的理解，tm 包的 TermDocumentMatrix 功能无法正常工作。它似乎正在按照我没有要求的条款进行处理。

这是一个例子：

require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf), 
                                                 removePunctuation = FALSE))
rownames(tdm)

我们从输出中可以看出，标点符号已经被去掉了，表达式"rising...what"也被拆分了：

 [1] "a"         "about"     "am"        "and"       "astrology" "cap"       "capricorn" "does"      "i"         "me"        "moon"      "rising"    "say"       "sun"       "that"     
[16] "what"

在中，问题出在删除标点符号的分词器上。但是，我使用的是默认的 words 分词器，我认为它不会这样做：

> sapply(corpus, words)
      [,1]           
 [1,] "Astrology:"   
 [2,] "I"            
 [3,] "am"           
 [4,] "a"            
 [5,] "Capricorn"    
 [6,] "Sun"          
 [7,] "Cap"          
 [8,] "moon"         
 [9,] "and"          
[10,] "cap"          
[11,] "rising...what"
[12,] "does"         
[13,] "that"         
[14,] "say"          
[15,] "about"        
[16,] "me?"

观察到的行为是否不正确，或者我的误解是什么？

Answer 1

您得到了一个 SimpleCorpus 对象，其中 came with tm package version 0.7 并且 - 根据 ?SimpleCorpus -

takes internally various shortcuts to boost performance and minimize memory pressure

class(corpus)
# [1] "SimpleCorpus" "Corpus"

现在，正如 help(TermDocumentMatrix) 所述：

Available local options are documented in termFreq and are internally delegated to a termFreq call. This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp)...

所以你不是使用words作为tokenizer，这确实会给你

words(sentence)
 [1] "Astrology:"    "I"             "am"            "a"             "Capricorn"     "Sun"           "Cap"          
 [8] "moon"          "and"           "cap"           "rising...what" "does"          "that"          "say"          
[15] "about"         "me?"

如评论中所述，您可以将您的语料库明确设为 Volatile ?VCorpus 以获得完全的灵活性：

A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object

corpus <- VCorpus(VectorSource(sentence)) 
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))

TermDocumentMatrix 进行未经请求的清理（例如删除标点符号）

TermDocumentMatrix doing unrequested cleaning (e.g. removing punctuation)

r

tm