pyPdf PdfFileReader 与 PdfFileWriter

Question

我有以下代码：

import os
from pyPdf import PdfFileReader, PdfFileWriter

path = "C:/Real Python/Course materials/Chapter 12/Practice files"

input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_PDF = PdfFileWriter()

for page_num in range(1, 4):
    output_PDF.addPage(input_file.getPage(page_num))

output_file_name = os.path.join(path, "Output/portion.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()

直到现在我只是从 Pdfs 阅读，后来学会了从 Pdf 到 txt 的写入...但是现在这个... 为什么 PdfFileReader 与 PdfFileWriter

差异如此之大

有人可以解释一下吗？我希望是这样的：

import os
from pyPdf import PdfFileReader, PdfFileWriter

path = "C:/Real Python/Course materials/Chapter 12/Practice files"

input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))

output_file_name = os.path.join(path, "out Pride and Prejudice.pdf")
output_file = PdfFileWriter(file(output_file_name, "wb"))

for page_num in range(1,4):
    page = input_file.petPage(page_num)
    output_file.addPage(page_num)
    output_file.write(page)

有什么帮助吗？？？谢谢

编辑 0:.addPage() 有什么作用？

for page_num in range(1, 4):
        output_PDF.addPage(input_file.getPage(page_num))

是否只创建了 3 个空白页？

编辑 1： 有人可以解释发生了什么：

1) output_PDF = PdfFileWriter()

2) output_PDF.addPage(input_file.getPage(page_num))

3) output_PDF.write(output_file)

第 3 个将 JUST CREATED(!) 对象传递给 output_PDF，为什么？

Answer 1

这很可能是因为 PDF 不是完全线性的 - "header" 实际上位于文件末尾。

如果每次更改时都将文件写入磁盘，则您的计算机需要不断将数据推送到磁盘上。相反，该模块（可能）将有关文档的信息存储在对象 (PdfFileWriter) 中，然后在您请求时将该数据转换为实际的 PDF 文件。

Answer 2

问题基本上是 PDF Cross-Reference table。

这是一个有点混乱的意大利面条怪兽，包含对页面、字体、objects、元素的引用，所有这些都需要 link 在一起以允许随机访问。

每次更新一个文件，都需要重建这个table。该文件首先在内存中创建，因此这只需发生一次，并进一步降低了销毁文件的机会。

output_PDF = PdfFileWriter()

这会在内存中创建 space 供 PDF 进入。（从您的旧 pdf 中提取）

output_PDF.addPage(input_file.getPage(page_num))

将您输入的 pdf 中的页面添加到在内存中创建的 PDF 文件（您想要的页面）。

output_PDF.write(output_file)

最后，这会将存储在内存中的 object 写入文件，从而将 header、cross-reference table 和 link 组合在一起所有的笨蛋。

编辑：据推测，JUST CREATED 标志指示 PyPDF 开始构建适当的 tables 和 link 东西。

--

回应 .txt 和 csv 的原因：

当您从文本或 CSV 文件复制时，没有现有的数据结构需要理解和移动以确保格式、图像放置和表单数据（输入部分等）等内容得到正确保留和创建.

pyPdf PdfFileReader 与 PdfFileWriter

pyPdf PdfFileReader vs PdfFileWriter

python

pypdf