使用 itext 编辑 pdf 时出现异常

getting exception while redacting pdf using itext

我在尝试使用 itext 编辑 pdf 文档时遇到异常。 这个问题是非常零星的,就像有时它在工作,有时它会抛出错误。

at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access00(PdfContentStreamProcessor.java:60)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:991)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpContentOperator.invoke(PdfCleanUpContentOperator.java:140)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:286)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:425)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUpPage(PdfCleanUpProcessor.java:160)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUp(PdfCleanUpProcessor.java:135)
at RedactionClass.tgestRedactJavishsInput(RedactionClass.java:56)
at RedactionClass.main(RedactionClass.java:23)

我用来编辑的代码如下:

public static void testRedact() throws IOException, DocumentException {

    InputStream resource = new FileInputStream("D:/itext/edited_120192824_5 (1).pdf");
    OutputStream result = new FileOutputStream(new File(OUTPUTDIR,
            "aviteshs.pdf"));

    PdfReader reader = new PdfReader(resource);
    PdfStamper stamper = new PdfStamper(reader, result);
    int pageCount = reader.getNumberOfPages();
    Rectangle linkLocation1 = new Rectangle(440f, 700f, 470f, 710f);
    Rectangle linkLocation2 = new Rectangle(308f, 205f, 338f, 215f);
    Rectangle linkLocation3 = new Rectangle(90f, 155f, 130f, 165f);
    List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
    for (int currentPage = 1; currentPage <= pageCount; currentPage++) {
        if (currentPage == 1) {
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation1, BaseColor.BLACK));
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation2, BaseColor.BLACK));
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation3, BaseColor.BLACK));
        } else {
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation1, BaseColor.BLACK));
        }
    }
    PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations,
            stamper);
    try {
        cleaner.cleanUp();
    } catch (Exception e) {
        e.printStackTrace();
    }
    stamper.close();
    reader.close();

}

由于客户文档原因我无法共享它,试图找出一些相同的测试数据。

请在此处查找文档:

https://drive.google.com/file/d/0B-zalNTEeIOwM1JJVWctcW8ydU0/view?usp=drivesdk

简而言之:这里NullPointerException的原因是iText不支持从显示它们的页面继承表单XObject资源。根据 PDF 规范此构造已过时,但在遵循早期 PDF 参考而非规范的 PDF 中可能会遇到它。

原因

相关文档的第 1 页包含 4 个名为 I1M0P1[=87 的 XObject 资源=], 和 Q0:

正如您在屏幕截图中看到的,Q0 没有自己的 Resources 字典。但它最后的指示是

q
413 0 0 125 75 3086 cm
/I1 Do
Q

我估计它引用了资源 I1.

现在,如果是 XObjects 形式,iText 假定其内容引用的资源包含在它们自己的 Resources 字典中。

结果:iText 访问 null 字典并发生 NullPointerException

规格

PDF 规范 ISO 32000-1 规定:

A resource dictionary shall be associated with a content stream in one of the following ways:

  • For a content stream that is the value of a page’s Contents entry (or is an element of an array that is the value of that entry), the resource dictionary shall be designated by the page dictionary’s Resources or is inherited, as described under 7.7.3.4, "Inheritance of Page Attributes," from some ancestor node of the page object.

  • For other content streams, a conforming writer shall include a Resources entry in the stream's dictionary specifying the resource dictionary which contains all the resources used by that content stream. This shall apply to content streams that define form XObjects, patterns, Type 3 fonts, and annotation.

  • PDF files written obeying earlier versions of PDF may have omitted the Resources entry in all form XObjects and Type 3 fonts used on a page. All resources that are referenced from those forms and fonts shall be inherited from the resource dictionary of the page on which they are used. This construct is obsolete and should not be used by conforming writers.

(ISO 32000-1,第 7.8.3 节 - 资源词典)

因此,在手头的情况下,我们处于已过时的选项三的情况下,Q0 引用了 XObject I1 中定义的Q0页面的资源字典用于.

相关文档的版本 header 声称符合 PDF 1.5(与 PDF 规范的 PDF 1.7 形成对比)。那么让我们看看 PDF Reference 1.5。选项三对应的段落是:

  • A form XObject or a Type 3 font’s glyph description may omit the Resources entry, in which case resources will be looked up in the Resources entry of the page on which the form or font is used. This practice is not recommended.

因此,总而言之,所讨论的 PDF 使用的结构被 PDF 规范(2008 年发布,使用了九年!)称为过时的,甚至 PDF 参考文件声称符合建议反对的结构。另一方面,iText 不支持这种过时的结构。

关于如何解决这个问题的想法

基本上 PDF 清理代码必须扩展到

  • 记住PdfCleanUpProcessor
  • 中当前页的资源
  • PdfCleanUpContentOperator 方法 invoke 中使用这些当前页面资源,以防 Do 运算符在没有自己的资源的情况下引用表单 XObject。

不幸的是,invoke 中使用的一些成员是私有的。因此,必须要么复制 PdfCleanUp 代码,要么依靠反射。

(iText 5.5.12-SNAPSHOT)

iText 7

iText 7 PDF 清理工具也会为您的 PDF 遇到问题,这里的例外是 IllegalStateException 声称 "Graphics state is always deleted after event dispatching. If you want to preserve it in renderer info, use preserveGraphicsState method after receiving renderer info."

由于此异常是在事件调度期间抛出的,因此此错误消息没有意义。不幸的是,PDF 清理工具在 iText 7 中已成为封闭源代码,因此查明问题并不容易。

(iText 7.0.3-SNAPSHOT;PDF 清理 1.0.2-SNAPSHOT)