从 Python 中的 PDF 元数据中提取关键字

Question

我有一个 PDF 文件，我想从它的元数据中获取一些信息。为此，我遵循以下程序：

from PyPDF2 import PdfFileReader    
mypath = "your_pdf_file.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()

对于手头的文档，输出是：

Out[230]: 
{'/CrossmarkDomainExclusive': 'true',
 '/CreationDate': "D:20181029074117+05'30'",
 '/CrossMarkDomains#5B2#5D': 'elsevier.com',
 '/Author': 'Nicola Gennaioli',
 '/Creator': 'Elsevier',
 '/ElsevierWebPDFSpecifications': '6.5',
 '/Subject': 'Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011',
 '/CrossmarkMajorVersionDate': '2010-04-23',
 '/CrossMarkDomains#5B1#5D': 'sciencedirect.com',
 '/robots': 'noindex',
 '/ModDate': "D:20181029074135+05'30'",
 '/AuthoritativeDomain#5B1#5D': 'sciencedirect.com',
 '/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',
 '/doi': '10.1016/j.jmoneco.2018.04.011',
 '/Title': 'Banks, government Bonds, and Default: What do the data Say?',
 '/AuthoritativeDomain#5B2#5D': 'elsevier.com',
 '/Producer': 'Acrobat Distiller 10.1.10 (Windows)'}

但是，我发现 PyPDF2 库没有 "access" /Keywords 部分信息的属性。也就是这一点输出：

'/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',

所以，我需要一些关于如何获取元数据输出信息的帮助 [在本例中：Sovereign Risk; Sovereign Default; Government Bonds]。

为了重现我正在分享的输出 link to the document

举个例子

更新：

print(pdf_info.title)
Banks, government Bonds, and Default: What do the data Say?

print(pdf_info.subject)
Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011

但是当我尝试为 /Keywords 部分做类似的事情时，我收到以下属性错误：

pdf_info.keywords
Traceback (most recent call last):

  File "<ipython-input-295-3852401ef983>", line 1, in <module>
    pdf_info.keywords

AttributeError: 'DocumentInformation' object has no attribute 'keywords'

Answer 1

键 /Keywords 实际上存在于 getDocumentInfo 返回的字典中，所以你不需要做任何特殊的事情（除了首先测试它是否存在或将其包装在 try，以防它不出现在另一个文件中）：

from PyPDF2 import PdfFileReader    
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
if '/Keywords' in pdf_info:
    print (pdf_info['/Keywords'])

打印

Sovereign Risk; Sovereign Default; Government Bonds

这确实是示例 PDF 中字段中的关键字。

另一种选择是通过在 pip 放置它的 PYPDF2 文件夹中编辑 pdf.py 来添加 keywords 到公开的 PDF 属性。您可以在我的版本中的第 2781 行附近的 class DocumentInformation 中找到 title、creator、author 和更多属性的创建。所有这些属性的创建都遵循一个简单的方案，因此添加一个完全没有问题：

keywords = property(lambda self: self.getText("/Keywords"))
"""Read-only property accessing the document's **producer**.
If the document was converted to PDF from another format, this is
the name of the application (for example, OSX Quartz) that converted
it to PDF. Returns a unicode string (``TextStringObject``)
or ``None`` if the producer is not specified."""
keywords_raw = property(lambda self: self.get("/Keywords"))
"""The "raw" version of producer; can return a ``ByteStringObject``."""

（我添加 keywords_raw 只是因为其他属性也这样做了。不过，我不能随手说出它们的用途。）

在那之后你的代码实际工作：

from PyPDF2 import PdfFileReader    
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
print (pdf_info.keywords)

结果，再次：

Sovereign Risk; Sovereign Default; Government Bonds

从 Python 中的 PDF 元数据中提取关键字

Extracting the keywords from PDF metadata in Python

python

nlp

pypdf2