删除标点符号、数字和空格不起作用

Removing Punctuation, Numbers, and Whitespace not working

我正在尝试从语料库中删除标点符号、数字和白色 space。

我的代码是:

# Create a corpus
bd_corpus =  Corpus(VectorSource(bd_text))

# Clean the corpus by removing puncuation, numbers, and white spaces
bd_clean <- tm_map(bd_corpus,removePunctuation)
bd_clean <- tm_map(bd_corpus,removeNumbers)
bd_clean <- tm_map(bd_corpus,removeStripwhitespace)

wordcloud(bd_clean)

#modify your word cloud
wordcloud(bd_clean, random.order = F, max.words = 25, scale = c(7, 0.5))

输出的是词云,但是词云中有冒号,反斜杠,句号等,如"here," and "hey," and "people."

另外这里是控制台输出:

# Clean the corpus by removing puncuation, numbers, and white spaces
> bd_clean <- tm_map(bd_corpus,removePunctuation)

Warning message:
In tm_map.SimpleCorpus(bd_corpus, removePunctuation) :
  transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeNumbers)

Warning message:
In tm_map.SimpleCorpus(bd_corpus, removeNumbers) :
  transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeStripwhitespace)

Error in tm_map.SimpleCorpus(bd_corpus, removeStripwhitespace) : 
  object 'removeStripwhitespace' not found

以上评论来自@Gregor:

假设我有 x <- 1。然后我 运行 这些命令:y <- x + 1,y <- x + 2,y <- x + 3。y 是什么,在结束? 4 是正确答案——因为当我们 运行 y <- x + 3 时,y 之前是什么并不重要。您正在做同样的事情:bd_clean <- tm_map(bd_corpus,removePunctuation) 从 bd_corpus 中删除标点符号。您的下一行 bd_clean <- tm_map(bd_corpus,removeNumbers) 从 bd_corpus 中删除数字,并覆盖没有标点符号的版本。相反,您需要 bd_clean <- tm_map(bd_corpus, bd_clean),以您已经完成的工作为基础。