宇宙立方体 "Error in pixCreateNoInit: pix_malloc fail for data"

Tesseract "Error in pixCreateNoInit: pix_malloc fail for data"

尝试 运行 在松散地基于 this 的函数中使用此函数,但是,由于 xPDF 可以将 PDF 转换为 PNG,因此我跳过了 ImageMagick 转换步骤以及错误的逻辑使用 function(i) 过程,因为 pdftopng 需要一个根名称,在这种情况下是 "ocrbook-000001.png" 并且在查找原始 PDF 文件名的 PNG 时会抛出错误。

我现在的问题是让 Tesseract 对我的 PNG 文件做任何事情。我收到错误:

Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.

这是我的代码:

lapply(myfiles, function(i){

shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
    lapply(mypngs, function(z){
    shell(shQuote(paste0("tesseract ", z, " out")))
    file.remove(paste0(z))
    })
})

显然,问题是 DPI 设置得太高,Tesseract 无法处理。将 PDFtoPNG DPI 参数从 600 更改为 150 似乎已解决了该问题。 Tesseract 似乎有一个最大 DPI 来理​​解和知道该做什么。

我还将我的代码从静态命名约定更正为更动态的模仿文件原始名称的代码。

  dest <- "C:\users\YOURNAME\desktop"

  files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
      })


  myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
    lapply(myppms, function(y){
      shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
      file.remove(paste0(y,".ppm"))
      })

  mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
    lapply(mytiffs, function(z){
      shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
      file.remove(paste0(z,".tif"))
      })