如何编辑 pdf 文件,替换其数据?
How to edit a pdf file, replacing its data?
我正在尝试旋转 pdf 文件中的页面,然后用相同 pdf 文件中旋转后的页面替换旧页面。
我写了下面的代码:
#!/usr/bin/python
import os
from pyPdf import PdfFileReader, PdfFileWriter
my_path = "/home/USER/Desktop/files/"
input_file_name = os.path.join(my_path, "myfile.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
input_file.decrypt("MyPassword")
output_PDF = PdfFileWriter()
for num_page in range(0, input_file.getNumPages()):
page = input_file.getPage(num_page)
page.rotateClockwise(270)
output_PDF.addPage(page)
#Trying to replace old data with new data in the original file, not
#create a new file and add the new data!
output_file_name = os.path.join(my_path, "myfile.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
上面的代码报错了!我什至尝试使用:
input_file = PdfFileReader(file(input_file_name, "r+b"))
但是也没用...
换行:
output_file_name = os.path.join(my_path, "myfile.pdf")
与:
output_file_name = os.path.join(my_path, "myfile2.pdf")
修复了所有问题,但这不是我想要的...
有什么帮助吗?
错误代码:
Traceback (most recent call last): File "12-5.py", line 22, in
output_PDF.write(output_file) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 264, in write
self._sweepIndirectReferences(externalReferenceMap, self._root) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in
_sweepIndirectReferences
self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in
_sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in
_sweepIndirectReferences
self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in
_sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 324, in
_sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, data[i]) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in
_sweepIndirectReferences
self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in
_sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 324, in
_sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, data[i]) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 345, in
_sweepIndirectReferences
newobj = data.pdf.getObject(data) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 649, in getObject
retval = readObject(self.stream, self) File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 67, in
readObject
return DictionaryObject.readFromStream(stream, pdf) File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 564, in
readFromStream
raise utils.PdfReadError, "Unable to find 'endstream' marker after stream." pyPdf.utils.PdfReadError: Unable to find 'endstream' marker
after stream.
我怀疑问题是 PyPDF 在写入文件时正在读取文件。
正如您所注意到的,正确的解决方法是写入一个单独的文件,然后用新文件替换原始文件。像这样:
output_file_name = os.path.join(my_path, "myfile-temporary.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
os.rename(output_file_name, input_file_name)
我写了一些代码来简化这个:https://github.com/shazow/unstdlib.py/blob/master/unstdlib/standard/contextlib_.py#L14
from unstdlib.standard.contextlib_ import open_atomic
with open_atomic(input_file_name, "wb") as output_file:
output_PDF.write(output_file)
这将自动创建一个临时文件,写入其中,然后替换原始文件。
编辑:我最初误读了这个问题。以下是我不正确但可能对其他人有帮助的答案。
您的代码很好,应该可以在 "most" PDF 上正常运行。
您遇到的问题是 PyPDF 与您尝试使用的特定 PDF 不兼容。这可能是 PyPDF 中的错误,也可能是 PDF 不完全有效。
您可以尝试两件事:
看看PyPDF2能否读取文件。使用 pip install PyPDF2
安装 PyPDF2,将 import pyPdf …
替换为 import PyPDF2 …
,然后重新 运行 您的脚本。
使用另一个程序重新编码您的 PDF,看看是否可行。例如,使用 convert bad.pdf bad.ps; convert bad.ps maybe-good.pdf
之类的东西可能 可以解决问题。
我正在尝试旋转 pdf 文件中的页面,然后用相同 pdf 文件中旋转后的页面替换旧页面。
我写了下面的代码:
#!/usr/bin/python
import os
from pyPdf import PdfFileReader, PdfFileWriter
my_path = "/home/USER/Desktop/files/"
input_file_name = os.path.join(my_path, "myfile.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
input_file.decrypt("MyPassword")
output_PDF = PdfFileWriter()
for num_page in range(0, input_file.getNumPages()):
page = input_file.getPage(num_page)
page.rotateClockwise(270)
output_PDF.addPage(page)
#Trying to replace old data with new data in the original file, not
#create a new file and add the new data!
output_file_name = os.path.join(my_path, "myfile.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
上面的代码报错了!我什至尝试使用:
input_file = PdfFileReader(file(input_file_name, "r+b"))
但是也没用...
换行:
output_file_name = os.path.join(my_path, "myfile.pdf")
与:
output_file_name = os.path.join(my_path, "myfile2.pdf")
修复了所有问题,但这不是我想要的...
有什么帮助吗?
错误代码:
Traceback (most recent call last): File "12-5.py", line 22, in output_PDF.write(output_file) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 264, in write self._sweepIndirectReferences(externalReferenceMap, self._root) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 324, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, data[i]) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 324, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, data[i]) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 345, in _sweepIndirectReferences newobj = data.pdf.getObject(data) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 649, in getObject retval = readObject(self.stream, self) File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 67, in readObject return DictionaryObject.readFromStream(stream, pdf) File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 564, in readFromStream raise utils.PdfReadError, "Unable to find 'endstream' marker after stream." pyPdf.utils.PdfReadError: Unable to find 'endstream' marker after stream.
我怀疑问题是 PyPDF 在写入文件时正在读取文件。
正如您所注意到的,正确的解决方法是写入一个单独的文件,然后用新文件替换原始文件。像这样:
output_file_name = os.path.join(my_path, "myfile-temporary.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
os.rename(output_file_name, input_file_name)
我写了一些代码来简化这个:https://github.com/shazow/unstdlib.py/blob/master/unstdlib/standard/contextlib_.py#L14
from unstdlib.standard.contextlib_ import open_atomic
with open_atomic(input_file_name, "wb") as output_file:
output_PDF.write(output_file)
这将自动创建一个临时文件,写入其中,然后替换原始文件。
编辑:我最初误读了这个问题。以下是我不正确但可能对其他人有帮助的答案。
您的代码很好,应该可以在 "most" PDF 上正常运行。
您遇到的问题是 PyPDF 与您尝试使用的特定 PDF 不兼容。这可能是 PyPDF 中的错误,也可能是 PDF 不完全有效。
您可以尝试两件事:
看看PyPDF2能否读取文件。使用
pip install PyPDF2
安装 PyPDF2,将import pyPdf …
替换为import PyPDF2 …
,然后重新 运行 您的脚本。使用另一个程序重新编码您的 PDF,看看是否可行。例如,使用
convert bad.pdf bad.ps; convert bad.ps maybe-good.pdf
之类的东西可能 可以解决问题。