创建可搜索的 PDF 时如何保留 PDF 中的图像和样式？

Question

我有一个网站，我的客户可以在其中上传他们的文件（主要是 PDF）。我希望能够使 PDF 可搜索，但我不想更改 PDF 的外观。我已经尝试创建一个 .NET 端点来实现我可以 POST 实现的目标。

我已经尝试将 iTextSharp 与 Tesseract 结合使用，但它们都没有给我想要的东西。这是我试过的代码：

使用 tesseract 从 pdf 中获取文本：

     using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
     using (var img = Pix.LoadFromFile(testImagePath))
     using (var page = engine.Process(img))
     {
        var text = page.GetText();
     }

然后使用 iTextSharp 从旧版本生成 PDF：

// open the reader
PdfReader reader = new PdfReader(oldFile);
Rectangle size = reader.GetPageSizeWithRotation(1);
Document document = new Document(size);

// open the writer
FileStream fs = new FileStream(newFile, FileMode.Create, FileAccess.Write);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
document.Open();

// the pdf content
PdfContentByte cb = writer.DirectContent;

// select the font properties
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252,BaseFont.NOT_EMBEDDED);
cb.SetColorFill(BaseColor.DARK_GRAY);
cb.SetFontAndSize(bf, 8);

// write the text in the pdf content
cb.BeginText();
string text = "Some random blablablabla...";
// put the alignment and coordinates here
cb.ShowTextAligned(1, text, 520, 640, 0);
cb.EndText();
cb.BeginText();
text = "Other random blabla...";
// put the alignment and coordinates here
cb.ShowTextAligned(2, text, 100, 200, 0);
cb.EndText();

// create the new page and add it to the pdf
PdfImportedPage page = writer.GetImportedPage(reader, 1);
cb.AddTemplate(page, 0, 0);

// close the streams and voilá the file should be changed :)
document.Close();
fs.Close();
writer.Close();
reader.Close();

但是我在生成所需输出时遇到问题。有没有更简单的方法来实现我正在寻找的东西？这是我试图使其可搜索的 PDF 示例。我不想丢失图像或 PDF 的字体/样式。我只是想让它变得可搜索：

https://www.fujitsu.com/global/Images/sv600_c_normal.pdf

Answer 1

如果您有兴趣为此利用商业产品，LEADTOOLS SDK has an OCR toolkit with image-over-text functionality。此功能将原始文件的图像设置为输出 PDF 中的叠加层，既使文本可搜索又保持原始输入文件的外观。

我能够使用以下代码将您的文档转换为仍然代表原始文档的可搜索版本：

     string folderPath = "filepath";

     string inputFilename = Path.Combine(folderPath, "sv600_c_normal.pdf");
     string outputFilename = Path.Combine(folderPath, "sv600_c_normal-output.pdf");

     IOcrEngine engine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD);
     engine.Startup(null, null, null, null);

     PdfDocumentOptions pdfOptions = engine.DocumentWriterInstance.GetOptions(DocumentFormat.Pdf) as PdfDocumentOptions;
     pdfOptions.ImageOverText = true;
     engine.DocumentWriterInstance.SetOptions(DocumentFormat.Pdf, pdfOptions);

     engine.AutoRecognizeManager.Run(inputFilename, outputFilename, DocumentFormat.Pdf, null, null);

这是示例文件的 output。它是可搜索的，并且与原版相似。

免责声明：我在这家公司工作

创建可搜索的 PDF 时如何保留 PDF 中的图像和样式？

How to preserve images and styling in PDF when creating a searchable PDF?

c#

pdf

ocr

tesseract

itext