使用 Python3 和 pdfrw 编辑 PDF 元数据字段
Editing PDF metadata fields with Python3 and pdfrw
我正在尝试编辑 PDF 的元数据 Title
字段,以尽可能包含 ASCII 等效项。我正在使用 Python3 和模块 pdfrw
.
如何执行替换元数据字段的字符串操作?
我的测试代码在这里:
from pdfrw import PdfReader, PdfWriter, PdfString
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
# this statement is breaking pdfrw
trailer.Info.Title = unicode_normalize(trailer.Info.Title)
# also have tried:
#trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')
追溯是:
Traceback (most recent call last):
File "get_metadata.py", line 68, in <module>
main()
File "get_metadata.py", line 54, in main
edit_title_metadata(pdf)
File "get_metadata.py", line 11, in edit_title_metadata
trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
File "get_metadata.py", line 18, in unicode_normalize
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
File "/path_to_python/python3.7/site-packages/pdfrw/objects/pdfstring.py", line 550, in encode
if isinstance(source, uni_type):
TypeError: isinstance() arg 2 must be a type or tuple of types
备注:
This issue 在 GitHub 可能是相关的。
FWIW,也遇到与 Python3.6
相同的错误
我分享了 pdf(有非 ascii 连字符,unicode 字符 \u2010)
.
wget https://gist.github.com/philshem/71507d4e8ecfabad252fbdf4d9f8bdd2/raw/cce346ab39dd6ecb3a718ad3f92c9f546761e87b/Anadon-2011-Scientific%2520Opinion%2520on%2520the%2520safety%2520e.pdf
您必须在元数据字段上使用 .decode()
方法:
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
以及完整的工作代码:
from pdfrw import PdfReader, PdfWriter, PdfReader
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')
我正在尝试编辑 PDF 的元数据 Title
字段,以尽可能包含 ASCII 等效项。我正在使用 Python3 和模块 pdfrw
.
如何执行替换元数据字段的字符串操作?
我的测试代码在这里:
from pdfrw import PdfReader, PdfWriter, PdfString
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
# this statement is breaking pdfrw
trailer.Info.Title = unicode_normalize(trailer.Info.Title)
# also have tried:
#trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')
追溯是:
Traceback (most recent call last):
File "get_metadata.py", line 68, in <module>
main()
File "get_metadata.py", line 54, in main
edit_title_metadata(pdf)
File "get_metadata.py", line 11, in edit_title_metadata
trailer.Info.Title = PdfString(unicode_normalize(trailer.Info.Title))
File "get_metadata.py", line 18, in unicode_normalize
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
File "/path_to_python/python3.7/site-packages/pdfrw/objects/pdfstring.py", line 550, in encode
if isinstance(source, uni_type):
TypeError: isinstance() arg 2 must be a type or tuple of types
备注:
This issue 在 GitHub 可能是相关的。
FWIW,也遇到与 Python3.6
相同的错误
我分享了 pdf(有非 ascii 连字符,unicode 字符 \u2010)
.
wget https://gist.github.com/philshem/71507d4e8ecfabad252fbdf4d9f8bdd2/raw/cce346ab39dd6ecb3a718ad3f92c9f546761e87b/Anadon-2011-Scientific%2520Opinion%2520on%2520the%2520safety%2520e.pdf
您必须在元数据字段上使用 .decode()
方法:
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
以及完整的工作代码:
from pdfrw import PdfReader, PdfWriter, PdfReader
import unicodedata
def edit_title_metadata(inpdf):
trailer = PdfReader(inpdf)
trailer.Info.Title = unicode_normalize(trailer.Info.Title.decode())
PdfWriter("test.pdf", trailer=trailer).write()
return
def unicode_normalize(s):
return unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
if __name__ == "__main__":
edit_title_metadata('Anadon-2011-Scientific Opinion on the safety e.pdf')