删除标点符号、数字和空格不起作用
Removing Punctuation, Numbers, and Whitespace not working
我正在尝试从语料库中删除标点符号、数字和白色 space。
我的代码是:
# Create a corpus
bd_corpus = Corpus(VectorSource(bd_text))
# Clean the corpus by removing puncuation, numbers, and white spaces
bd_clean <- tm_map(bd_corpus,removePunctuation)
bd_clean <- tm_map(bd_corpus,removeNumbers)
bd_clean <- tm_map(bd_corpus,removeStripwhitespace)
wordcloud(bd_clean)
#modify your word cloud
wordcloud(bd_clean, random.order = F, max.words = 25, scale = c(7, 0.5))
输出的是词云,但是词云中有冒号,反斜杠,句号等,如"here," and "hey," and "people."
另外这里是控制台输出:
# Clean the corpus by removing puncuation, numbers, and white spaces
> bd_clean <- tm_map(bd_corpus,removePunctuation)
Warning message:
In tm_map.SimpleCorpus(bd_corpus, removePunctuation) :
transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeNumbers)
Warning message:
In tm_map.SimpleCorpus(bd_corpus, removeNumbers) :
transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeStripwhitespace)
Error in tm_map.SimpleCorpus(bd_corpus, removeStripwhitespace) :
object 'removeStripwhitespace' not found
以上评论来自@Gregor:
假设我有 x <- 1。然后我 运行 这些命令:y <- x + 1,y <- x + 2,y <- x + 3。y 是什么,在结束? 4 是正确答案——因为当我们 运行 y <- x + 3 时,y 之前是什么并不重要。您正在做同样的事情:bd_clean <- tm_map(bd_corpus,removePunctuation) 从 bd_corpus 中删除标点符号。您的下一行 bd_clean <- tm_map(bd_corpus,removeNumbers) 从 bd_corpus 中删除数字,并覆盖没有标点符号的版本。相反,您需要 bd_clean <- tm_map(bd_corpus, bd_clean),以您已经完成的工作为基础。
我正在尝试从语料库中删除标点符号、数字和白色 space。
我的代码是:
# Create a corpus
bd_corpus = Corpus(VectorSource(bd_text))
# Clean the corpus by removing puncuation, numbers, and white spaces
bd_clean <- tm_map(bd_corpus,removePunctuation)
bd_clean <- tm_map(bd_corpus,removeNumbers)
bd_clean <- tm_map(bd_corpus,removeStripwhitespace)
wordcloud(bd_clean)
#modify your word cloud
wordcloud(bd_clean, random.order = F, max.words = 25, scale = c(7, 0.5))
输出的是词云,但是词云中有冒号,反斜杠,句号等,如"here," and "hey," and "people."
另外这里是控制台输出:
# Clean the corpus by removing puncuation, numbers, and white spaces
> bd_clean <- tm_map(bd_corpus,removePunctuation)
Warning message:
In tm_map.SimpleCorpus(bd_corpus, removePunctuation) :
transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeNumbers)
Warning message:
In tm_map.SimpleCorpus(bd_corpus, removeNumbers) :
transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeStripwhitespace)
Error in tm_map.SimpleCorpus(bd_corpus, removeStripwhitespace) :
object 'removeStripwhitespace' not found
以上评论来自@Gregor:
假设我有 x <- 1。然后我 运行 这些命令:y <- x + 1,y <- x + 2,y <- x + 3。y 是什么,在结束? 4 是正确答案——因为当我们 运行 y <- x + 3 时,y 之前是什么并不重要。您正在做同样的事情:bd_clean <- tm_map(bd_corpus,removePunctuation) 从 bd_corpus 中删除标点符号。您的下一行 bd_clean <- tm_map(bd_corpus,removeNumbers) 从 bd_corpus 中删除数字,并覆盖没有标点符号的版本。相反,您需要 bd_clean <- tm_map(bd_corpus, bd_clean),以您已经完成的工作为基础。