TermDocumentMatrix 进行未经请求的清理(例如删除标点符号)
TermDocumentMatrix doing unrequested cleaning (e.g. removing punctuation)
根据我对文档的理解,tm
包的 TermDocumentMatrix
功能无法正常工作。它似乎正在按照我没有要求的条款进行处理。
这是一个例子:
require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf),
removePunctuation = FALSE))
rownames(tdm)
我们从输出中可以看出,标点符号已经被去掉了,表达式"rising...what"也被拆分了:
[1] "a" "about" "am" "and" "astrology" "cap" "capricorn" "does" "i" "me" "moon" "rising" "say" "sun" "that"
[16] "what"
在 中,问题出在删除标点符号的分词器上。但是,我使用的是默认的 words
分词器,我认为它不会这样做:
> sapply(corpus, words)
[,1]
[1,] "Astrology:"
[2,] "I"
[3,] "am"
[4,] "a"
[5,] "Capricorn"
[6,] "Sun"
[7,] "Cap"
[8,] "moon"
[9,] "and"
[10,] "cap"
[11,] "rising...what"
[12,] "does"
[13,] "that"
[14,] "say"
[15,] "about"
[16,] "me?"
观察到的行为是否不正确,或者我的误解是什么?
您得到了一个 SimpleCorpus
对象,其中 came with tm package version 0.7 并且 - 根据 ?SimpleCorpus
-
takes internally various shortcuts to boost performance and minimize
memory pressure
class(corpus)
# [1] "SimpleCorpus" "Corpus"
现在,正如 help(TermDocumentMatrix)
所述:
Available local options are documented in termFreq and are internally
delegated to a termFreq call. This is different for a SimpleCorpus. In
this case all options are processed in a fixed order in one pass to
improve performance. It always uses the Boost Tokenizer (via Rcpp)...
所以你不是使用words
作为tokenizer,这确实会给你
words(sentence)
[1] "Astrology:" "I" "am" "a" "Capricorn" "Sun" "Cap"
[8] "moon" "and" "cap" "rising...what" "does" "that" "say"
[15] "about" "me?"
如评论中所述,您可以将您的语料库明确设为 Volatile ?VCorpus
以获得完全的灵活性:
A volatile corpus is fully kept in memory and thus all changes only
affect the corresponding R object
corpus <- VCorpus(VectorSource(sentence))
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))
根据我对文档的理解,tm
包的 TermDocumentMatrix
功能无法正常工作。它似乎正在按照我没有要求的条款进行处理。
这是一个例子:
require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf),
removePunctuation = FALSE))
rownames(tdm)
我们从输出中可以看出,标点符号已经被去掉了,表达式"rising...what"也被拆分了:
[1] "a" "about" "am" "and" "astrology" "cap" "capricorn" "does" "i" "me" "moon" "rising" "say" "sun" "that"
[16] "what"
在 words
分词器,我认为它不会这样做:
> sapply(corpus, words)
[,1]
[1,] "Astrology:"
[2,] "I"
[3,] "am"
[4,] "a"
[5,] "Capricorn"
[6,] "Sun"
[7,] "Cap"
[8,] "moon"
[9,] "and"
[10,] "cap"
[11,] "rising...what"
[12,] "does"
[13,] "that"
[14,] "say"
[15,] "about"
[16,] "me?"
观察到的行为是否不正确,或者我的误解是什么?
您得到了一个 SimpleCorpus
对象,其中 came with tm package version 0.7 并且 - 根据 ?SimpleCorpus
-
takes internally various shortcuts to boost performance and minimize memory pressure
class(corpus)
# [1] "SimpleCorpus" "Corpus"
现在,正如 help(TermDocumentMatrix)
所述:
Available local options are documented in termFreq and are internally delegated to a termFreq call. This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp)...
所以你不是使用words
作为tokenizer,这确实会给你
words(sentence)
[1] "Astrology:" "I" "am" "a" "Capricorn" "Sun" "Cap"
[8] "moon" "and" "cap" "rising...what" "does" "that" "say"
[15] "about" "me?"
如评论中所述,您可以将您的语料库明确设为 Volatile ?VCorpus
以获得完全的灵活性:
A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object
corpus <- VCorpus(VectorSource(sentence))
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))