当资源不在字典中时如何从外观流中提取内容?

How to extract content from appearance stream when Resources not in dictionary?

我正在尝试使用 iTextSharp 读取 PDF 注释的外观流,并从流中获取内容文本。

我正在使用以下代码:

public String ExtractAnnotationText(PdfStream xObject)
        {
          PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);   
          ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
          PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
          byte[] contentByteArray = ContentByteUtils.GetContentBytesFromContentObject(xObject);
          processor.ProcessContent(contentByteArray, resources);
          return strategy.GetResultantText();
        }

xObject从外观字典中取出来,这样传入:

PRStream value = (PRStream)appearancesDictionary.GetAsStream(key);
String text = ExtractAnnotationText(value);

这通常适用于从注释中获取外观文本,但我发现了一个 FreeTextCallout 示例,其中 xObject 没有 /Resources 键,如其 hashMap 所示:

[/Type, /XObject]
[/Subtype, /Form]   
[/FormType, 1]
[/Length, 71]
[/Matrix, [1, 0, 0, 1, -28.7103, -643.893]]
[/BBox, [28.7103, 643.893, 597.85, 751.068]]
[/Filter, /FlateDecode]

在这种情况下,是否有另一种方法来构造一个 Resources 字典以传递给 PdfContentStreamProcessor.ProcessContent()?或者甚至是不使用 ProcessContent()?

获取文本的不同方式

关于此 pdf 规范声明:

A resource dictionary shall be associated with a content stream in one of the following ways:

  • For a content stream that is the value of a page’s Contents entry (or is an element of an array that is the value of that entry), the resource dictionary shall be designated by the page dictionary’s Resources or is inherited, as described under 7.7.3.4, "Inheritance of Page Attributes," from some ancestor node of the page object.

  • For other content streams, a conforming writer shall include a Resources entry in the stream's dictionary specifying the resource dictionary which contains all the resources used by that content stream. This shall apply to content streams that define form XObjects, patterns, Type 3 fonts, and annotation.

  • PDF files written obeying earlier versions of PDF may have omitted the Resources entry in all form XObjects and Type 3 fonts used on a page. All resources that are referenced from those forms and fonts shall be inherited from the resource dictionary of the page on which they are used. This construct is obsolete and should not be used by conforming writers.

(ISO 32000-1 的第 7.8.3 节 - 资源字典)

因此,您找到的示例是第三个选项的情况,或者示例根本不需要任何资源,或者您的示例文件已损坏。