从 R 中用户定义的语料库中删除停用词

Removing stopwords from a user-defined corpus in R


documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")


documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation


documents <- Corpus(VectorSource(documents))


documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords



已经有人问过这个问题 here 但没有给出答案。这个错误是什么意思?


是的,我正在使用 tm 包。

这是 sessionInfo() 的输出:

R 版本 3.0.2 (2013-09-25) 平台:x86_64-apple-darwin10.8.0(64 位)


> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6


> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "

当我 运行 遇到 tm 问题时,我常常只是编辑原文。

要删除单词有点尴尬,但您可以将 tm 的停用词列表中的正则表达式粘贴在一起。

stopwords_regex = paste(stopwords('en'), collapse = '\b|\b')
stopwords_regex = paste0('\b', stopwords_regex, '\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

您可以使用 quanteda 包删除停用词,但首先要确保您的词是标记,然后使用以下内容:

x<- tokens_select(x,stopwords(), selection=)