高亮文字在 OCR PDF 中未正确显示
Highlighting words are not displayed correctly in OCR PDF
我突出显示了 "F O R M - 2" 文本和 "Title of the Invention :"。第一个字符串正确突出显示,但第二个字符串 "itle of the Invention :" 仅突出显示。我用下面的代码来突出这个词。
private void highlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string[] splitText)
{
try
{
PdfReader reader = new PdfReader(outputFile);
using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, pageno, strategy);
for (int i = 0; i < splitText.Length; i++)
{
List<iTextSharp.text.Rectangle> MatchesFound = strategy.GetTextLocations(splitText[i].Trim(), StringComparison.CurrentCultureIgnoreCase);
foreach (Rectangle rect in MatchesFound)
{
float[] quad = { rect.Left , rect.Bottom, rect.Right, rect.Bottom, rect.Left , rect.Top , rect.Right, rect.Top };
//Create our hightlight
PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quad);
//Set the color
highlight.Color = BaseColor.YELLOW;
PdfAppearance appearance = PdfAppearance.CreateAppearance(stamper.Writer, rect.Width, rect.Height);
PdfGState state = new PdfGState();
state.BlendMode = new PdfName("Multiply");
appearance.SetGState(state);
appearance.Rectangle(0, 0, rect.Width, rect.Height);
appearance.SetColorFill(BaseColor.YELLOW);
appearance.Fill();
highlight.SetAppearance(PdfAnnotation.APPEARANCE_NORMAL, appearance);
//Add the annotation
stamper.AddAnnotation(highlight, pageno);
}
}
}
}
reader.Close();
File.Copy(highLightFile, outputFile,true);
File.Delete(highLightFile);
}
catch (Exception ex)
{
throw;
}
}
如您所料,
It's not displaying correctly because of OCR PDF
或更准确地说,因为在 OCR 期间绘制在图像下方的字母与图像相比位置不正确,但您的代码会检查这些字母以定位标记。
更详细
比较扫描图像中“发明名称”周围的条纹
以及底层 OCR 信息中的相应条带
人们立即意识到“发明名称”在后者中显得有点偏右。
@BrunoLowagie 使区别更加明显:
I've brought the text to the foreground and made it red so that you see how much difference there is between the image and the OCR:
当您通过文本提取检索位置时,您检索到的位置也有点偏右。
更快的检查
如果您在 Adobe Reader 中简单地搜索“发明名称”,您也可以识别问题:
整页
看整页的OCR信息,一看就知道质量不是很好。因此,您在处理此文档时会发现很多问题。
整个扫描页面
整页OCR信息
我突出显示了 "F O R M - 2" 文本和 "Title of the Invention :"。第一个字符串正确突出显示,但第二个字符串 "itle of the Invention :" 仅突出显示。我用下面的代码来突出这个词。
private void highlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string[] splitText)
{
try
{
PdfReader reader = new PdfReader(outputFile);
using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, pageno, strategy);
for (int i = 0; i < splitText.Length; i++)
{
List<iTextSharp.text.Rectangle> MatchesFound = strategy.GetTextLocations(splitText[i].Trim(), StringComparison.CurrentCultureIgnoreCase);
foreach (Rectangle rect in MatchesFound)
{
float[] quad = { rect.Left , rect.Bottom, rect.Right, rect.Bottom, rect.Left , rect.Top , rect.Right, rect.Top };
//Create our hightlight
PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quad);
//Set the color
highlight.Color = BaseColor.YELLOW;
PdfAppearance appearance = PdfAppearance.CreateAppearance(stamper.Writer, rect.Width, rect.Height);
PdfGState state = new PdfGState();
state.BlendMode = new PdfName("Multiply");
appearance.SetGState(state);
appearance.Rectangle(0, 0, rect.Width, rect.Height);
appearance.SetColorFill(BaseColor.YELLOW);
appearance.Fill();
highlight.SetAppearance(PdfAnnotation.APPEARANCE_NORMAL, appearance);
//Add the annotation
stamper.AddAnnotation(highlight, pageno);
}
}
}
}
reader.Close();
File.Copy(highLightFile, outputFile,true);
File.Delete(highLightFile);
}
catch (Exception ex)
{
throw;
}
}
如您所料,
It's not displaying correctly because of OCR PDF
或更准确地说,因为在 OCR 期间绘制在图像下方的字母与图像相比位置不正确,但您的代码会检查这些字母以定位标记。
更详细
比较扫描图像中“发明名称”周围的条纹
以及底层 OCR 信息中的相应条带
人们立即意识到“发明名称”在后者中显得有点偏右。
@BrunoLowagie 使区别更加明显:
I've brought the text to the foreground and made it red so that you see how much difference there is between the image and the OCR:
当您通过文本提取检索位置时,您检索到的位置也有点偏右。
更快的检查
如果您在 Adobe Reader 中简单地搜索“发明名称”,您也可以识别问题:
整页
看整页的OCR信息,一看就知道质量不是很好。因此,您在处理此文档时会发现很多问题。
整个扫描页面
整页OCR信息