R：文本挖掘，为每个文档创建单词列表

Question

我正在阅读目录中多个 PDF 的文本。然后，我使用 tidytext::unnest_tokens() 函数将这些文本拆分为单个单词（标记）。有人可以告诉我，如何在 test-tibble 中添加一个附加列，其中包含每个单词来自的文件的名称？

library(pdftools)
library(tidyverse)
library(tidytext)

files <- list.files(pattern = "pdf$")
content <- lapply(files, pdf_text)
list <- unlist(content, recursive = TRUE, use.names = TRUE)
df = data.frame(text = list)

test <- df %>% tidytext::unnest_tokens(word, text)

Answer 1

你可以这样做：

files <- list.files(pattern = "pdf$")
content <- stack(sapply(files, pdf_text, simplify = FALSE))
df %>% 
   tidytext::unnest_tokens(word, value)

Answer 2

您可以尝试以下方法。不是对所有文件使用 unlist，而是将整个文件列表从 purrr 传递到 map_df。然后，您可以添加带有 filename 的列以及 word 列。

library(pdftools)
library(tidyverse)
library(tidytext)

files <- list.files(pattern = "pdf$")

map_df(files, ~ data.frame(txt = pdf_text(.x)) %>%
         mutate(filename = .x) %>%
         unnest_tokens(word, txt))

Answer 3

plyr 包作为绑定到 df 并将列表名称用作新列的好函数：

library(pdftools)
library(plyr)
library(tidyverse)
library(tidytext)

files <- list.files(pattern = "pdf$")
content <- lapply(files, pdf_text) 
# set list name acording to files
names(content) <- files 
list <- unlist(content, recursive = TRUE, use.names = TRUE)

# use the acorind function from plyr packages and check the result
plyr::ldply(list)

R：文本挖掘，为每个文档创建单词列表

R: Text Mining, create list of words per document

r

text-mining

tidyverse

tidytext