包 tm：如何避免删除停用词

Question

我想避免删除停用词，但我发现无论 tm 的参数设置如何，它总是会删除一些停用词。

library(tm)
documents <- c("This is a list containing the tallest buildings in San    Francisco")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(stopwords=FALSE))
colnames(matrix)
# [1] "buildings"  "containing" "francisco"  "list"       "san"       
# [6] "tallest"    "the"        "this"

DocumentTermMatrix 似乎删除了停用词 "is" 和 "in".

我怎样才能避免这种情况？设置 stopwords=TRUE 只会阻止删除 "the"。我怎样才能防止删除 "is" 和 "in"？

Answer 1

您的问题不是 DocumentTermMatrix 将 "is" 和 "in" 视为停用词，而是因为它们是短于 3 个字符的词。标记器的默认设置是将长度为 3 到无穷大的字符串视为单词，即排除其他短于 3 的字符串。

您可以按如下方式修改您的控件以包含单字母以上的单词

matrix <- DocumentTermMatrix(corpus,control=list(stopwords=FALSE,
                                                 wordLengths=c(1, Inf)))

我相信这就是你想要的

> colnames(matrix)
 [1] "a"          "buildings"  "containing" "francisco"  "in"         "is"        
 [7] "list"       "san"        "tallest"    "the"        "this"

你没有在你的问题中提到 "a"，所以如果你想排除它（以及其他类似 "I"），请将 wordLength 设置为从 2 开始。

包 tm：如何避免删除停用词

Package tm: How do I avoid removing stopwords

r

stop-words

tm