从 R 中用户定义的语料库中删除停用词

Question

我有一套文件：

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

在这组文档中，我想删除停用词。我已经删除标点符号并转换为小写，使用：

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

首先我转换为语料库对象：

documents <- Corpus(VectorSource(documents))

然后我尝试删除停用词：

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

但是最后一行会导致以下错误：

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() 进行调试。

已经有人问过这个问题 here 但没有给出答案。这个错误是什么意思？

编辑

是的，我正在使用 tm 包。

这是 sessionInfo() 的输出：

R 版本 3.0.2 (2013-09-25) 平台：x86_64-apple-darwin10.8.0（64 位）

Answer 1

也许可以尝试使用tm_map函数来转换文档。它似乎适用于我的情况。

> documents = c("She had toast for breakfast",
+  "The coffee this morning was excellent", 
+  "For lunch let's all have pancakes", 
+  "Later in the day, there will be more talks", 
+  "The talks on the first day were great", 
+  "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

这会产生

> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "

Answer 2

当我运行遇到 tm 问题时，我常常只是编辑原文。

要删除单词有点尴尬，但您可以将 tm 的停用词列表中的正则表达式粘贴在一起。

stopwords_regex = paste(stopwords('en'), collapse = '\b|\b')
stopwords_regex = paste0('\b', stopwords_regex, '\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

Answer 3

您可以使用 quanteda 包删除停用词，但首先要确保您的词是标记，然后使用以下内容：

library(quanteda)
x<- tokens_select(x,stopwords(), selection=)

从 R 中用户定义的语料库中删除停用词

Removing stopwords from a user-defined corpus in R

r

topic-modeling

tm