高亮文字在 OCR PDF 中未正确显示

Highlighting words are not displayed correctly in OCR PDF

我突出显示了 "F O R M - 2" 文本和 "Title of the Invention :"。第一个字符串正确突出显示,但第二个字符串 "itle of the Invention :" 仅突出显示。我用下面的代码来突出这个词。

  private void highlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string[] splitText)
{
    try
    {
        PdfReader reader = new PdfReader(outputFile);

        using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
        {
            using (PdfStamper stamper = new PdfStamper(reader, fs))
            {
                myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();

                string currentText = PdfTextExtractor.GetTextFromPage(reader, pageno, strategy);
                for (int i = 0; i < splitText.Length; i++)
                {
                    List<iTextSharp.text.Rectangle> MatchesFound = strategy.GetTextLocations(splitText[i].Trim(), StringComparison.CurrentCultureIgnoreCase);
                    foreach (Rectangle rect in MatchesFound)
                    {

                        float[] quad = { rect.Left , rect.Bottom, rect.Right, rect.Bottom, rect.Left , rect.Top , rect.Right, rect.Top  };
                        //Create our hightlight
                        PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quad);
                        //Set the color
                        highlight.Color = BaseColor.YELLOW;

                        PdfAppearance appearance = PdfAppearance.CreateAppearance(stamper.Writer, rect.Width, rect.Height);
                        PdfGState state = new PdfGState();
                        state.BlendMode = new PdfName("Multiply");
                        appearance.SetGState(state);
                        appearance.Rectangle(0, 0, rect.Width, rect.Height);
                        appearance.SetColorFill(BaseColor.YELLOW);
                        appearance.Fill();

                        highlight.SetAppearance(PdfAnnotation.APPEARANCE_NORMAL, appearance);

                        //Add the annotation
                        stamper.AddAnnotation(highlight, pageno);
                    }
                }
            }
        }
        reader.Close();
        File.Copy(highLightFile, outputFile,true);
        File.Delete(highLightFile);
    }
    catch (Exception ex)
    {
        throw;
    }

}

如您所料,

It's not displaying correctly because of OCR PDF

或更准确地说,因为在 OCR 期间绘制在图像下方的字母与图像相比位置不正确,但您的代码会检查这些字母以定位标记。

更详细

比较扫描图像中“发明名称”周围的条纹

以及底层 OCR 信息中的相应条带

人们立即意识到“发明名称”在后者中显得有点偏右。

@BrunoLowagie 使区别更加明显:

I've brought the text to the foreground and made it red so that you see how much difference there is between the image and the OCR:

当您通过文本提取检索位置时,您检索到的位置也有点偏右。

更快的检查

如果您在 Adob​​e Reader 中简单地搜索“发明名称”,您也可以识别问题:

整页

看整页的OCR信息,一看就知道质量不是很好。因此,您在处理此文档时会发现很多问题。

整个扫描页面

整页OCR信息