使用 R 中新的 Tesseract OCR 引擎将许多 .pdf 文件转换为 .txt 文件

Question

我的主管要我将 .pdf 个文件转换为 .txt 个文件，以便通过关键字提取算法进行处理。 .pdf 文件是扫描的法庭文件。她基本上想要一个名为 court_document 的文件夹，其中包含每个子目录，每个子目录都命名为一个 13 个字符的案例 ID。我收到了大约 500 个 .pdf 个文件，文件名为“caseid_docketnumber_date_documentdescription.pdf”，例如“1-20-cr-30164_d2_5_23_2020_complaint.pdf”。她还希望每个 .txt 文件都保存为“docketnumber_date_documentdescription.txt”，例如“d2_5_23_2020_complaint.txt”。 .pdf 文件保存在我的工作目录 court_document 中。期望的结果是一个名为 court_document 的根目录，其中包含 500 个子目录，每个子目录包含 .txt 个文件。我按如下方式解决了这个问题：

# Packages  ---------------------------------------------------------------
library(tesseract)
library(pdftools)
library(magrittr)
library(purrr)
library(bench)

# Function to convert .pdf to .txt ----------------------------------------
pdf_convert_txt <- function(pdf) {

  # Case id
  # The pdf file names are such that the first 13 characters are the case id's
  case_id <- str_sub(
    string = pdf,
    start = 1L,
    end = 13L
  )
  # File path for writing .txt file to subdirectory
  txt_file_path <- paste0(
    # Subdirectory
    paste0(case_id, "/"),
    # Ensure .txt file name does not include case id (first 14 char) and .pdf extension (last 4 char)
    str_sub(
      string = pdf,
      start = 15L,
      end = -5L
    ),
    # File extension
    ".txt"
  )

  # Create subdirectory using case id as its name
  if (dir.exits(paths = case_id) == FALSE) dir.create(path = case_id)

  # Convert pdf to png
  # This function creates one .png file per pdf page in current working dir
  # It also returns a character vector of .png file names
  pdf_convert(
    pdf = pdf,
    format = "png",
    dpi = 200,
  ) %>%
    # Pass the character vector of .png file names to tesseract::ocr()
    # This function returns plain text by default
    ocr(image = .) %>%
    # Concatenate and save plain text .txt file to subdirectory created above
    cat(file = txt_file_path)

  # Remove all png files in current working directory
  file.remove(
    list.files(pattern = ".png")
  )
}

# Apply pdf_convert_txt() to all .pdf files in current working dir -------------------
map(
  # All .pdf files in current working directory court_document
  .x = list.files(pattern = ".pdf"),
  .f = pdf_convert_txt
)

此解决方案有效，但分析表明 ocr(image = .) 确实会降低代码速度。一份典型的法庭文件至少有 50 页，因此要从中提取文本的 50 个 png 文件。在我的 intel macbook pro 2020 上，仅这一行运行就需要大约 72000 毫秒。我只有这么多 .pdf 个文件，所以我想知道是否有任何方法可以突破这个瓶颈。或者我可能需要切换到其他工具。任何意见和建议将不胜感激。

Answer 1

根据 phiver 的建议和我自己的一些实验，我能够将以下代码块的运行时间减少大约 40%，甚至在使用多会话之前，我的典型 pdf 有 50 页：

  pdf_convert(
    pdf = pdf,
    format = "png",
    dpi = 80,
  ) %>%
    ocr(image = .) %>%
    cat(file = txt_file_path)

我通过在从 .pdf 转换为 .png 时降低分辨率（dpi 参数）来做到这一点。幸运的是，我正在处理的 .pdf 文件类型不需要高分辨率，OCR 引擎就可以从图像中提取字符。最后，为了使用多会话（parallelly::plan() + furrr::future_map()），我在函数之外使用了以下块：

  file.remove(
    list.files(pattern = ".png")
  )

由于我是运行宁并行进程，我需要取出上面的块，否则单个进程将删除工作目录中的所有 .png 文件，包括那些需要完成的文件其他进程。

使用 R 中新的 Tesseract OCR 引擎将许多 .pdf 文件转换为 .txt 文件

Convert many .pdf files to .txt files using the new Tesseract OCR engine in R

ocr

tesseract

r

file-conversion