R:将 Tibbles 转换为术语文档矩阵
R: Converting Tibbles to a Term Document Matrix
我正在使用 R 编程语言。我学会了如何从互联网上获取 pdf 文件并将它们加载到 R 中。例如,下面我将莎士比亚的 3 本书加载到 R 中:
library(pdftools)
library(tidytext)
library(textrank)
library(tm)
#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_1 <- article_words %>%
anti_join(stop_words, by = "word")
#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_2<- article_words %>%
anti_join(stop_words, by = "word")
#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_3 <- article_words %>%
anti_join(stop_words, by = "word")
这些文件中的每一个(例如 article_words_1)现在都是一个“tibble”文件。从这里开始,我想将它们转换成“文档术语矩阵”,以便我可以对这些进行文本挖掘和 NLP :
#convert to document term matrix
myCorpus <- Corpus(VectorSource(article_words_1, article_words_2, article_words_3))
tdm <- TermDocumentMatrix(myCorpus)
inspect(tdm)
但这似乎会导致错误:
Error in VectorSource(article_words_1, article_words_2, article_words_3) :
unused arguments (article_words_2, article_words_3)
有人可以告诉我我做错了什么吗?
谢谢
如错误消息所示,VectorSource
仅接受 1 个参数。您可以 rbind
将数据集放在一起并将其传递给 VectorSource
函数。
library(tm)
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))
inspect(tdm)
#<<TermDocumentMatrix (terms: 14952, documents: 2)>>
#Non-/sparse entries: 14952/14952
#Sparsity : 50%
#Maximal term length: 25
#Weighting : term frequency (tf)
#Sample :
# Docs
#Terms 1 2
# "act", 0 397
# "cassio", 0 258
# "ftln", 0 10303
# "hamlet", 0 617
# "iago", 0 371
# "lord", 0 355
# "macbeth", 0 386
# "othello", 0 462
# "sc", 0 337
# "thou", 0 346
我正在使用 R 编程语言。我学会了如何从互联网上获取 pdf 文件并将它们加载到 R 中。例如,下面我将莎士比亚的 3 本书加载到 R 中:
library(pdftools)
library(tidytext)
library(textrank)
library(tm)
#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_1 <- article_words %>%
anti_join(stop_words, by = "word")
#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_2<- article_words %>%
anti_join(stop_words, by = "word")
#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_3 <- article_words %>%
anti_join(stop_words, by = "word")
这些文件中的每一个(例如 article_words_1)现在都是一个“tibble”文件。从这里开始,我想将它们转换成“文档术语矩阵”,以便我可以对这些进行文本挖掘和 NLP :
#convert to document term matrix
myCorpus <- Corpus(VectorSource(article_words_1, article_words_2, article_words_3))
tdm <- TermDocumentMatrix(myCorpus)
inspect(tdm)
但这似乎会导致错误:
Error in VectorSource(article_words_1, article_words_2, article_words_3) :
unused arguments (article_words_2, article_words_3)
有人可以告诉我我做错了什么吗?
谢谢
如错误消息所示,VectorSource
仅接受 1 个参数。您可以 rbind
将数据集放在一起并将其传递给 VectorSource
函数。
library(tm)
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))
inspect(tdm)
#<<TermDocumentMatrix (terms: 14952, documents: 2)>>
#Non-/sparse entries: 14952/14952
#Sparsity : 50%
#Maximal term length: 25
#Weighting : term frequency (tf)
#Sample :
# Docs
#Terms 1 2
# "act", 0 397
# "cassio", 0 258
# "ftln", 0 10303
# "hamlet", 0 617
# "iago", 0 371
# "lord", 0 355
# "macbeth", 0 386
# "othello", 0 462
# "sc", 0 337
# "thou", 0 346