removeWords 不工作
removeWords not working
我正在尝试构建此处找到的危险数据集的词云:https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
我的代码如下:
library(tm)
library(SnowballC)
library(wordcloud)
jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)
jeopCorpus <- Corpus(VectorSource(jeopQ$Question))
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english')))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
单词'the'和'this'仍然出现在词云中。为什么会发生这种情况,我该如何解决?
参数的构造好像没有right:seehere and here
tm_map(jeopCorpus, removeWords, c(stopwords("english"),"the","this"))
但如前所述,这些词已经包括在内,所以简单
tm_map(jeopCorpus, removeWords, stopwords("english"))
应该可以
问题在于您没有执行小写操作。很多问题都以 "The" 开头。停用词都是小写的,例如"the" 和 "this"。因为 "The" != "the", "The" 它没有从语料库中移除
如果您使用下面的代码,它应该可以正常工作:
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
我正在尝试构建此处找到的危险数据集的词云:https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
我的代码如下:
library(tm)
library(SnowballC)
library(wordcloud)
jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)
jeopCorpus <- Corpus(VectorSource(jeopQ$Question))
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english')))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
单词'the'和'this'仍然出现在词云中。为什么会发生这种情况,我该如何解决?
参数的构造好像没有right:seehere and here
tm_map(jeopCorpus, removeWords, c(stopwords("english"),"the","this"))
但如前所述,这些词已经包括在内,所以简单
tm_map(jeopCorpus, removeWords, stopwords("english"))
应该可以
问题在于您没有执行小写操作。很多问题都以 "The" 开头。停用词都是小写的,例如"the" 和 "this"。因为 "The" != "the", "The" 它没有从语料库中移除
如果您使用下面的代码,它应该可以正常工作:
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)