如何将带有我不关心的图像的 PDF 转换为文本？

Question

我正在尝试将 pdf 转换为文本文件。问题是那些 pdf 包含我不关心的图像（这是我要提取的文件类型 (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf)。请注意，如果我用鼠标 copy/paste，它工作得很好（除了换行符），所以我猜这是可能的。我在网上找到的大部分答案在只有文本的虚拟 pdf 上工作得很好，但在地图上给出的结果特别糟糕。例如，像这样

from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])

检索文本效果很好，但我有很多像这样的垃圾:

ERY

CTR

3

CH

A

因地图而出现

像这样的东西，它通过将 pdf 转换为图像然后读取图像来工作，面临同样的问题（我在 Whosebug 上的一个非常相似的线程上找到它，但没有答案）:

import pytesseract as pt
from PIL import Image
import sys 

def convert(name):
    pages = convert_from_path(name, dpi=200)
    for idx,page in enumerate(pages):
        page.save('page'+str(idx)+'.jpg', 'JPEG')
        quote = Image.open('page'+str(idx)+'.jpg')
        text = pt.image_to_string(quote, lang="fra")
        file_ex = open('page'+str(idx)+'.text',"w")
        file_ex.write(text)
        file_ex.close()



if __name__ == '__main__':
    convert(sys.argv[1])

最后，我尝试先删除图像，然后使用上述解决方案之一，但效果并不好:

from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader

# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")

 
src = PdfFileReader(inputStream)
output = PdfFileWriter()
 

[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
 

output.write(outputStream)
outputStream.close()

# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])

你知道怎么解决吗？它可以是任何语言。谢谢

Answer 1

您可以尝试的一种方法是使用能够解析 PDF 中文本字符的工具包，然后使用对象属性尝试删除不需要的地图标签，同时保留所需的文本字符。

例如，ParsePages method from LEADTOOLS PDF toolkit（这是我为该工具包的供应商工作后所熟悉的）可用于从 PDF 中获取文本：

using (PDFDocument document = new PDFDocument(pdfFileName))
{
   PDFParsePagesOptions options = PDFParsePagesOptions.All;
   document.ParsePages(options, 1, -1);

   using (StreamWriter writer = File.CreateText(txtFileName))
   {
      IList<PDFObject> objects = document.Pages[0].Objects;
      writer.WriteLine("Objects: {0}", objects.Count);
      foreach (PDFObject obj in objects)
      {
         if (obj.TextProperties.IsEndOfLine)
            writer.WriteLine(obj.Code);
         else
            writer.Write(obj.Code);
      }
      writer.WriteLine("---------------------");
   }
}

这将获取第一页 PDF 中的所有文本，以及您提到的不需要的结果。以下是摘录：

Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information 
0
B09SUP AIP 213/20 
7 
Aéronautique
 
Date de publication : 05 NOV 
e-mail  : sia.qualite@aviation-civile.gouv.fr 
Internet  : www.sia.aviation-civile.gouv.fr 
141
 
17˚
82
N20 
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry 
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020

可以使用更多代码来检查每个已解析字符的属性：

writer.WriteLine("  ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine("  Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine("  TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine("  TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine("  Code: {0}", obj.Code);
writer.WriteLine("------");

这将给出每个字符的属性：

  Objects: 3918
  ObjectType: Text
  Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
  TextProperties.FontHeight: 7.10454273223877
  TextProperties.FontIndex: 48
  Code: 5
------

使用这些属性，可以使用它们的属性过滤不需要的文本。例如，我注意到大部分不需要的文本的字体高度约为 7 个 PDF 单位，因此可能会更改第一个代码以避免提取任何小于 7.25 个 PDF 单位的文本：

foreach (PDFObject obj in objects)
{
   if (obj.TextProperties.FontHeight > 7.25)
   {
      if (obj.TextProperties.IsEndOfLine)
         writer.WriteLine(obj.Code);
      else
         writer.Write(obj.Code);
   }
}

提取的输出会给出更好的结果，摘录如下：

Objects: 3918
Service
de l’Information 
SUP AIP 213/20 
 
Aéronautique
Date de publication : 05 NOV 
e-mail  : sia.qualite@aviation-civile.gouv.fr 
Internet  : www.sia.aviation-civile.gouv.fr 
 
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry 
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020 
 
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE 
 
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD 
*
C
D
E

最后，您将不得不尝试提出一个好的标准来过滤掉不需要的文本，而不删除您需要保留的文本，使用这种方法。

如何将带有我不关心的图像的 PDF 转换为文本？

How to convert PDF with images which I don't care about to text?

pdf

text

converters

pdftotext