使用 PyPDF2 复制 PDF 会产生空白页
Duplicating PDF with PyPDF2 gives blank pages
我正在使用 PyPDF2 更改 PDF 文档(添加书签)。因此,我需要阅读整个源 PDF,然后将其写出,尽可能完整地保留数据。仅将每个页面写入新的 PDF 对象可能不足以保留文档元数据。
PdfFileWriter()
确实有许多复制整个文件的方法:cloneDocumentFromReader
、appendPagesFromReader
和 cloneReaderDocumentRoot
。然而,他们都有问题。
如果我使用 cloneDocumentFromReader
或 appendPagesFromReader
,我会得到一个有效的 PDF 文件,页数正确,但所有页面都是空白的。
如果我使用 cloneReaderDocumentRoot
,我会得到一个最小的有效 PDF 文件,但没有页面或数据。
This has been asked before,但没有成功的答案。
关于 的其他问题已经提出,但我无法应用给出的答案。
这是我的代码:
def bookmark(incomingFile):
reader = PdfFileReader(incomingFile)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
#writer.cloneDocumentFromReader(reader)
my_table_of_contents = [
('Page 1', 0),
('Page 2', 1),
('Page 3', 2)
]
# writer.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
for title, pagenum in my_table_of_contents:
writer.addBookmark(title, pagenum, parent=None)
writer.setPageMode("/UseOutlines")
with open(incomingFile, "wb") as fp:
writer.write(fp)
当 PyPDF2 无法将书签添加到 PdfFileWriter 对象时,我往往会遇到错误,因为它没有任何页面或类似内容。
我也折腾了很久,终于发现PyPDF2有这个issue。
基本上我将 this answer's 代码复制到 C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py
(这将取决于您的分布)在第 382 行附近的 cloneDocumentFromReader
函数。
之后,我能够使用 writer.cloneDocumentFromReader(pdf)
将 reader
页面附加到 writer
,并且在我的情况下,更新 PDF 元数据(主题、关键字等) .
希望对你有帮助
'''
Create a copy (clone) of a document from a PDF file reader
:param reader: PDF file reader instance from which the clone
should be created.
:callback after_page_append (function): Callback function that is invoked after
each page is appended to the writer. Signature includes a reference to the
appended page (delegates to appendPagesFromReader). Callback signature:
:param writer_pageref (PDF page reference): Reference to the page just
appended to the document.
'''
debug = False
if debug:
print("Number of Objects: %d" % len(self._objects))
for obj in self._objects:
print("\tObject is %r" % obj)
if hasattr(obj, "indirectRef") and obj.indirectRef != None:
print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))
# Variables used for after cloning the root to
# improve pre- and post- cloning experience
mustAddTogether = False
newInfoRef = self._info
oldPagesRef = self._pages
oldPages = self.getObject(self._pages)
# If there have already been any number of pages added
if oldPages[NameObject("/Count")] > 0:
# Keep them
mustAddTogether = True
else:
# Through the page object out
if oldPages in self._objects:
newInfoRef = self._pages
self._objects.remove(oldPages)
# Clone the reader's root document
self.cloneReaderDocumentRoot(reader)
if not self._root:
self._root = self._addObject(self._root_object)
# Sweep for all indirect references
externalReferenceMap = {}
self.stack = []
newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)
# Delete the stack to reset
del self.stack
#Clean-Up Time!!!
# Get the new root of the PDF
realRoot = self.getObject(newRootRef)
# Get the new pages tree root and its ID Number
tmpPages = realRoot[NameObject("/Pages")]
newIdNumForPages = 1 + self._objects.index(tmpPages)
# Make an IndirectObject just for the new Pages
self._pages = IndirectObject(newIdNumForPages, 0, self)
# If there are any pages to add back in
if mustAddTogether:
# Set the new page's root's parent to the old
# page's root's reference
tmpPages[NameObject("/Parent")] = oldPagesRef
# Add the reference to the new page's root in
# the old page's kids array
newPagesRef = self._pages
oldPages[NameObject("/Kids")].append(newPagesRef)
# Set all references to the root of the old/new
# page's root
self._pages = oldPagesRef
realRoot[NameObject("/Pages")] = oldPagesRef
# Update the count attribute of the page's root
oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])
else:
# Bump up the info's reference b/c the old
# page's tree was bumped off
self._info = newInfoRef
我正在使用 PyPDF2 更改 PDF 文档(添加书签)。因此,我需要阅读整个源 PDF,然后将其写出,尽可能完整地保留数据。仅将每个页面写入新的 PDF 对象可能不足以保留文档元数据。
PdfFileWriter()
确实有许多复制整个文件的方法:cloneDocumentFromReader
、appendPagesFromReader
和 cloneReaderDocumentRoot
。然而,他们都有问题。
如果我使用 cloneDocumentFromReader
或 appendPagesFromReader
,我会得到一个有效的 PDF 文件,页数正确,但所有页面都是空白的。
如果我使用 cloneReaderDocumentRoot
,我会得到一个最小的有效 PDF 文件,但没有页面或数据。
This has been asked before,但没有成功的答案。
关于
这是我的代码:
def bookmark(incomingFile):
reader = PdfFileReader(incomingFile)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
#writer.cloneDocumentFromReader(reader)
my_table_of_contents = [
('Page 1', 0),
('Page 2', 1),
('Page 3', 2)
]
# writer.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
for title, pagenum in my_table_of_contents:
writer.addBookmark(title, pagenum, parent=None)
writer.setPageMode("/UseOutlines")
with open(incomingFile, "wb") as fp:
writer.write(fp)
当 PyPDF2 无法将书签添加到 PdfFileWriter 对象时,我往往会遇到错误,因为它没有任何页面或类似内容。
我也折腾了很久,终于发现PyPDF2有这个issue。
基本上我将 this answer's 代码复制到 C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py
(这将取决于您的分布)在第 382 行附近的 cloneDocumentFromReader
函数。
之后,我能够使用 writer.cloneDocumentFromReader(pdf)
将 reader
页面附加到 writer
,并且在我的情况下,更新 PDF 元数据(主题、关键字等) .
希望对你有帮助
'''
Create a copy (clone) of a document from a PDF file reader
:param reader: PDF file reader instance from which the clone
should be created.
:callback after_page_append (function): Callback function that is invoked after
each page is appended to the writer. Signature includes a reference to the
appended page (delegates to appendPagesFromReader). Callback signature:
:param writer_pageref (PDF page reference): Reference to the page just
appended to the document.
'''
debug = False
if debug:
print("Number of Objects: %d" % len(self._objects))
for obj in self._objects:
print("\tObject is %r" % obj)
if hasattr(obj, "indirectRef") and obj.indirectRef != None:
print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))
# Variables used for after cloning the root to
# improve pre- and post- cloning experience
mustAddTogether = False
newInfoRef = self._info
oldPagesRef = self._pages
oldPages = self.getObject(self._pages)
# If there have already been any number of pages added
if oldPages[NameObject("/Count")] > 0:
# Keep them
mustAddTogether = True
else:
# Through the page object out
if oldPages in self._objects:
newInfoRef = self._pages
self._objects.remove(oldPages)
# Clone the reader's root document
self.cloneReaderDocumentRoot(reader)
if not self._root:
self._root = self._addObject(self._root_object)
# Sweep for all indirect references
externalReferenceMap = {}
self.stack = []
newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)
# Delete the stack to reset
del self.stack
#Clean-Up Time!!!
# Get the new root of the PDF
realRoot = self.getObject(newRootRef)
# Get the new pages tree root and its ID Number
tmpPages = realRoot[NameObject("/Pages")]
newIdNumForPages = 1 + self._objects.index(tmpPages)
# Make an IndirectObject just for the new Pages
self._pages = IndirectObject(newIdNumForPages, 0, self)
# If there are any pages to add back in
if mustAddTogether:
# Set the new page's root's parent to the old
# page's root's reference
tmpPages[NameObject("/Parent")] = oldPagesRef
# Add the reference to the new page's root in
# the old page's kids array
newPagesRef = self._pages
oldPages[NameObject("/Kids")].append(newPagesRef)
# Set all references to the root of the old/new
# page's root
self._pages = oldPagesRef
realRoot[NameObject("/Pages")] = oldPagesRef
# Update the count attribute of the page's root
oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])
else:
# Bump up the info's reference b/c the old
# page's tree was bumped off
self._info = newInfoRef