用德语 Fraktur 编写的 5800+ PDF 的批量 OCR

Question

我想批处理 OCR 大约 5800 PDF（每个包含我上一个问题的 2 到 6 页 ) with open source command line tools on a Mac. The main propose of this adventure is that I want to retrieve as reliable as I can names (surnames most importantly) from the text of all these PDF. Here 是一个问题的示例。

在这一点上，我不知道该如何进行。你会怎么做？

我想首先将所有多页 PDF 转换为单页图像，如 png、jpg 或 tif，然后将所有相关图像移动到使用以下命令将一个 PDF 放入相应的文件夹中：

time for i in *.pdf; do mkdir "${i%.pdf}"; convert -colorspace GRAY -resize 3000x -units PixelsPerInch  "$i" "${i%.pdf}.jpg”; mv *.jpg "${i%.pdf}"; done

作为第二步，我会遇到这样一个问题，即我的 OCR 脚本需要进入每个文件夹，发挥它的魔力，然后离开它才能继续下一步。我不知道怎么写这个。脚本的核心是：

tesseract --tessdata-dir /usr/local/share/tessdata/ --oem 3 --psm 11 -l deu_frak *.jpg test.txt

由于 PDF 代表旧报纸文章，从 1810 年到 1832 年几乎每天都有发表，因此它们是用德语写成的 Fraktur。这种字体似乎对 tesseract 来说特别具有挑战性。我的文本输出通常是乱序的，e。 G。在上面的链接文章中，我只会在第一页检测到 791 到 801 个变音符号。根据所选选项，名称有可能无法被识别。

最后，我会用silver searcher在所有5800个txt文件中寻找名字，希望能得到。

time rg -i search_term_here

最后，如何确保获得最佳的 OCR 输出，以便获得文本中的大部分（姓氏）名称？

P.S.: 顺便说一下，tesseract 4 何时会出现在 Mac 和德国 Fraktur 训练数据中？

编辑：

这些是我用来实现我想要的命令的命令。 虽然tesseract的输出还有待提高

将每个 PDF 转换为 jpg 并将它们移动到相应的文件夹以保持顺序：

time parallel -j 8 'mkdir {.} && convert {} -colorspace GRAY -resize 3000x -units PixelsPerInch {.}/{.}.jpg' ::: *.pdf

使用 Fred 的 ImageMagick 脚本 textcleaner（我已将其移至 /usr/local/bin/ 以提高可用性），稍微增强 tesseract 输出：

time find . -name \*.jpg | parallel textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 {} {}

并行化 tesseract 分析：

time find . -name \*.jpg | parallel -j 8 “tesseract {} {.}.txt —tessdata-dir /usr/local/share/tessdata/ -l deu_frak”

搜索姓氏 the silver searcher:

time rg -t txt -i term

Answer 1

首先，如果您还没有安装 homebrew，我建议您安装 - 它是 Mac.

的优秀包管理器

那么我建议你安装 Poppler 软件包来获取 pdfimages 工具：

brew install poppler

然后您可以像这样从 PDF 中提取图像：

pdfimages SomeFile.pdf root

您将获得名为 root-000.ppm 和 root-001.ppm 的文件，它们可以与 tesseract 一起正常工作。或者，如果您想要 PNG 图片，可以添加 -png。由于有损压缩，我会避免使用 JPEG。

如果你能让它工作，我会建议你安装 GNU Parallel 与：

brew install parallel

我们可以并行地进行 OCR。

请仅在包含 5-6 份原件的小目录中尝试以下内容

我们还可以使用 GNU Parallel 并行提取图像，如下所示：

parallel 'mkdir {.} && pdfimages {} {.}/{.}' ::: *pdf

关于将 Fred 的 textcleaner 与 GNU Parallel 一起使用，并希望覆盖 JPEG，我想您会想要这样的东西：

find . -name \*.jpg | parallel textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 {} {}

用德语 Fraktur 编写的 5800+ PDF 的批量 OCR

Batch OCR of 5800+ PDF written in German Fraktur

pdf

ocr

macos

tesseract

imagemagick

请仅在包含 5-6 份原件的小目录中尝试以下内容