从 R 中用户定义的语料库中删除停用词
Removing stopwords from a user-defined corpus in R
我有一套文件:
documents = c("She had toast for breakfast",
"The coffee this morning was excellent",
"For lunch let's all have pancakes",
"Later in the day, there will be more talks",
"The talks on the first day were great",
"The second day should have good presentations too")
在这组文档中,我想删除停用词。我已经删除标点符号并转换为小写,使用:
documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation
首先我转换为语料库对象:
documents <- Corpus(VectorSource(documents))
然后我尝试删除停用词:
documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords
但是最后一行会导致以下错误:
THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() 进行调试。
已经有人问过这个问题 here 但没有给出答案。这个错误是什么意思?
编辑
是的,我正在使用 tm 包。
这是 sessionInfo() 的输出:
R 版本 3.0.2 (2013-09-25)
平台:x86_64-apple-darwin10.8.0(64 位)
也许可以尝试使用tm_map
函数来转换文档。它似乎适用于我的情况。
> documents = c("She had toast for breakfast",
+ "The coffee this morning was excellent",
+ "For lunch let's all have pancakes",
+ "Later in the day, there will be more talks",
+ "The talks on the first day were great",
+ "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 6
这会产生
> documents[[1]]$content
[1] " toast breakfast"
> documents[[2]]$content
[1] " coffee morning excellent"
> documents[[3]]$content
[1] " lunch lets pancakes"
> documents[[4]]$content
[1] "later day will talks"
> documents[[5]]$content
[1] " talks first day great"
> documents[[6]]$content
[1] " second day good presentations "
当我 运行 遇到 tm
问题时,我常常只是编辑原文。
要删除单词有点尴尬,但您可以将 tm
的停用词列表中的正则表达式粘贴在一起。
stopwords_regex = paste(stopwords('en'), collapse = '\b|\b')
stopwords_regex = paste0('\b', stopwords_regex, '\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')
> documents
[1] " toast breakfast" " coffee morning excellent"
[3] " lunch lets pancakes" "later day will talks"
[5] " talks first day great" " second day good presentations "
您可以使用 quanteda 包删除停用词,但首先要确保您的词是标记,然后使用以下内容:
library(quanteda)
x<- tokens_select(x,stopwords(), selection=)
我有一套文件:
documents = c("She had toast for breakfast",
"The coffee this morning was excellent",
"For lunch let's all have pancakes",
"Later in the day, there will be more talks",
"The talks on the first day were great",
"The second day should have good presentations too")
在这组文档中,我想删除停用词。我已经删除标点符号并转换为小写,使用:
documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation
首先我转换为语料库对象:
documents <- Corpus(VectorSource(documents))
然后我尝试删除停用词:
documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords
但是最后一行会导致以下错误:
THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() 进行调试。
已经有人问过这个问题 here 但没有给出答案。这个错误是什么意思?
编辑
是的,我正在使用 tm 包。
这是 sessionInfo() 的输出:
R 版本 3.0.2 (2013-09-25) 平台:x86_64-apple-darwin10.8.0(64 位)
也许可以尝试使用tm_map
函数来转换文档。它似乎适用于我的情况。
> documents = c("She had toast for breakfast",
+ "The coffee this morning was excellent",
+ "For lunch let's all have pancakes",
+ "Later in the day, there will be more talks",
+ "The talks on the first day were great",
+ "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 6
这会产生
> documents[[1]]$content
[1] " toast breakfast"
> documents[[2]]$content
[1] " coffee morning excellent"
> documents[[3]]$content
[1] " lunch lets pancakes"
> documents[[4]]$content
[1] "later day will talks"
> documents[[5]]$content
[1] " talks first day great"
> documents[[6]]$content
[1] " second day good presentations "
当我 运行 遇到 tm
问题时,我常常只是编辑原文。
要删除单词有点尴尬,但您可以将 tm
的停用词列表中的正则表达式粘贴在一起。
stopwords_regex = paste(stopwords('en'), collapse = '\b|\b')
stopwords_regex = paste0('\b', stopwords_regex, '\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')
> documents
[1] " toast breakfast" " coffee morning excellent"
[3] " lunch lets pancakes" "later day will talks"
[5] " talks first day great" " second day good presentations "
您可以使用 quanteda 包删除停用词,但首先要确保您的词是标记,然后使用以下内容:
library(quanteda)
x<- tokens_select(x,stopwords(), selection=)