使用 lapply 创建语料库时出现内存问题

Question

我最终的目标是将数千个pdf转换成语料库/文档术语矩阵来进行一些主题建模。我正在使用 pdftools 包导入我的 pdf 并使用 tm 包来准备我的数据以进行文本挖掘。我设法导入并转换了一个单独的 pdf，如下所示：

txt <- pdf_text("pdfexample.pdf")

#create corpus
txt_corpus <- Corpus(VectorSource(txt))

# Some basic text prep, with tm_map(), like:
txt_corpus <- tm_map(txt_corpus, tolower)

# create document term matrix
dtm <- DocumentTermMatrix(txt_corpus)

但是，我完全坚持自动化这个过程，而且我对循环或应用函数的经验有限。在将原始 pdf_text() 输出转换为语料库时，我的方法有运行内存问题，即使我只用 5 个 pdf 文件（总计：1.5MB）测试了我的代码。 R 试图分配超过一半 GB 的向量。这对我来说绝对不对。我的尝试是这样的：

# Create a list of all pdf paths
file_list <- list.files(path = "mydirectory",
                 full.names = TRUE,
                 pattern = "name*", # to import only specific pdfs
                 ignore.case = FALSE)

# Run a function that reads the pdf of each of those files:
all_files <- lapply(file_list, FUN = function(files) {
             pdf_text(files)
             })

all_files_corpus = lapply(all_files,
                          FUN = Corpus(DirSource())) # That's where I run into memory issues

我是不是做错了什么？不确定这是否只是一个内存问题，或者是否有更简单的方法来解决我的问题。至少，从我收集到的信息来看，lapply 应该比循环更有效地存储内存。但也许还有更多。几天来我一直试图自己解决它，但没有任何效果。

感谢 advice/hint 如何继续！

编辑：我尝试只用一个 pdf 执行 lapply，我的 R 再次崩溃，即使我完全没有容量问题，当使用第一个提到的代码时。

Answer 1

您可以编写一个函数，其中包含要在每个 pdf 上执行的一系列步骤。

pdf_to_dtm <- function(file) {
  txt <- pdf_text(file)
  #create corpus
  txt_corpus <- Corpus(VectorSource(txt))
  # Some basic text prep, with tm_map(), like:
  txt_corpus <- tm_map(txt_corpus, tolower)
  # create document term matrix
  dtm <- DocumentTermMatrix(txt_corpus)
  dtm
}

使用lapply 将函数应用于每个文件

file_list <- list.files(path = "mydirectory",
                 full.names = TRUE,
                 pattern = "name*", # to import only specific pdfs
                 ignore.case = FALSE)

all_files_corpus <- lapply(file_list, pdf_to_dtm)

使用 lapply 创建语料库时出现内存问题

Memory problems when using lapply for corpus creation

memory

r

corpus

text-mining

lapply