如何将在线书籍中的这个词库转换为术语文档矩阵？

Question

这是我的代码片段：

library(gutenbergr)
library(tm)
Alice <- gutenberg_download(c(11))
Alice <- Corpus(VectorSource(Alice))
cleanAlice <- tm_map(Alice, removeWords, stopwords('english'))
cleanAlice <- tm_map(cleanAlice, removeWords, c('Alice'))
cleanAlice <- tm_map(cleanAlice, tolower)
cleanAlice <- tm_map(cleanAlice, removePunctuation)
cleanAlice <- tm_map(cleanAlice, stripWhitespace)
dtm1 <- TermDocumentMatrix(cleanAlice)
dtm1

但随后我收到以下错误：

<<TermDocumentMatrix (terms: 3271, documents: 2)>>
Non-/sparse entries: 3271/3271
Sparsity           : 50%
Error in nchar(Terms(x), type = "chars") : 
  invalid multibyte string, element 12

我该如何处理？我应该先将语料库转换为纯文本文档吗？是不是书的文字格式有问题？

Answer 1

Gutenbergr returns a data.frame，不是文本向量。您只需要稍微调整一下代码，它就可以正常工作。而不是 VectorSource(Alice) 你需要 VectorSource(Alice$text)

library(gutenbergr)
library(tm)

# don't overwrite your download when you are testing
Alice <- gutenberg_download(c(11))

# specify the column in the data.frame
Alice_corpus <- Corpus(VectorSource(Alice$text))
cleanAlice <- tm_map(Alice_corpus, removeWords, stopwords('english'))
cleanAlice <- tm_map(cleanAlice, removeWords, c('Alice'))
cleanAlice <- tm_map(cleanAlice, tolower)
cleanAlice <- tm_map(cleanAlice, removePunctuation)
cleanAlice <- tm_map(cleanAlice, stripWhitespace)
dtm1 <- TermDocumentMatrix(cleanAlice)
dtm1

<<TermDocumentMatrix (terms: 3293, documents: 3380)>>
Non-/sparse entries: 13649/11116691
Sparsity           : 100%
Maximal term length: 46
Weighting          : term frequency (tf)

P.S。您可以忽略代码中的警告信息。

如何将在线书籍中的这个词库转换为术语文档矩阵？

How do I convert this corpus of words from an online book into a term document matrix?

r

matrix

text-mining