sohronit 页面 pdf 文件如何在 byte byte [] 中恢复回来

Question

我需要通过页面解析 PDF 文件并将每个单独加载到 byte[]。我使用 itext 库。

我下载了一个包含一页代码的文件：

   public Document addPageInTheDocument(String namePage, MultipartFile pdfData, Long documentId) throws IOException {
      notNull(namePage, INVALID_PARAMETRE);
      notNull(pdfData, INVALID_PARAMETRE);
      notNull(documentId, INVALID_PARAMETRE);
      byte[] in = pdfData.getBytes(); // size file 88747
      Page page = new Page(namePage);
      Document document = new Document();
      document.setId(documentId);
      PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfData.getBytes()));
      PdfDocument pdfDocument = new PdfDocument(reader);
      if (pdfDocument.getNumberOfPages() != 1) {
          throw new IllegalArgumentException();
      }
      byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
      page.setPageData(pdfDocument.getPage(1).getContentBytes());
      return addPageInTheDocument(document, page);
  }

我正在尝试使用此代码恢复文件：

ByteBuffer byteContent = new ByteBuffer() ;
    for (Map.Entry<String, Page> page : pages.entrySet()) {
       byteContent.append(page.getValue().getPageData());
    }
    PdfWriter writer = new PdfWriter(new FileOutputStream(book.getName() + modification + FORMAT));
    byte[] df = byteContent.toByteArray();
    PdfReader reader = new PdfReader(new ByteArrayInputStream(byteContent.toByteArray()));
    com.itextpdf.layout.Document itextDocument = new com.itextpdf.layout.Document(new PdfDocument(reader, writer));
    itextDocument.close();

为什么大小会相差这么大？以及为什么文件和页面，都byte[]创建文件？

Answer 1

让我们从您的尺寸问题开始：

byte[] in = pdfData.getBytes(); // size file 88747
...
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page

...

Why are there such a difference in size?

因为 PdfPage.getContentBytes() 没有 return 你所期望的。

您似乎期望它 return 给定页面内容的完整表示，并且可能会解释该方法的 Javadocs ("Get decoded bytes for the whole page content.") 的意思。

事实并非如此。 PdfPage.getContentBytes() returns 页面 内容流 的内容。这些内容流包含一系列构建页面的命令。但是这些命令采用 引用内容流外部数据的参数 ，例如：

当在 PDF 页面上绘制文本时，内容流包含选择字体的操作，但描述字体的数据以及在嵌入字体的情况下字体程序本身在内容流之外；
当绘制位图图像时，内容流通常包含一个引用内容流外部图像数据的操作；
有些操作引用所谓的 xobjects，它们本质上是独立的内容流，可以从任何页面调用；这些 xobject 也不包含在页面内容流中。

此外，还有注释（例如表单字段）和它们自己的内容流，这些内容流存储在单独的结构中。还有很多页面属性也在外面。

因此，大小存在差异，因为使用 getContentBytes.

只能得到页面定义的一小部分

现在让我们看看您的代码 "restoring the file"。

作为上述的必然结果，很明显您的代码只是连接了一些内容流，而没有提供这些流引用的外部资源。

但除此之外，您的代码还指出了对 PDF 页面性质的误解：它们不仅仅是您可以根据需要再次拆分和连接的斑点。它们是分布在整个 PDF 文件中的 PDF 对象的集合；不同的页面可以共享他们的一些对象（例如常用图像的字体）。

你可以做什么...

作为单个页面的表示，您应该使用包含该单个页面引用的数据的 PDF。 iText 示例 Burst.java 展示了如何做到这一点。

要再次加入这些单页 PDF，您可以使用 iText PdfMerger。记得设置智能模式（PdfWriter.setSmartMode(true)）以防止结果中的资源重复。

sohronit 页面 pdf 文件如何在 byte byte [] 中恢复回来

how sohronit page pdf file in byte byte [] and restore back

java

itext

itext7