Python 从加密的 PDF 中提取数据

Question

我是一名纯数学专业的应届毕业生，只上过几门基础编程课程。我正在实习，我有一个内部数据分析项目。我必须分析过去几年的内部 PDF。 PDF 是 "secured." 换句话说，它们是加密的。我们没有 PDF 密码，更不确定密码是否存在。但是，我们拥有所有这些文件，我们可以手动阅读它们。我们也可以打印它们。目标是用 Python 阅读它们，因为这是我们有一些想法的语言。

首先，我尝试使用一些 Python 库阅读 PDF。但是，我发现的 Python 库无法读取加密的 PDF。那时，我也无法使用 Adobe Reader 导出信息。

其次，我决定解密 PDF。我成功地使用了 Python 库 pykepdf。 Pykepdf 效果很好！但是，解密后的 PDF 无法使用前一点的 Python 库（PyPDF2 和 Tabula)。在这个时候，我们做了一些改进，因为使用 Adobe Reader 我可以从解密的 PDF 中导出信息，但目标是用 Python.

做所有事情

我展示的代码可以完美地处理未加密的 PDF，但不能处理加密的 PDF。它也不适用于通过 pykepdf 获得的解密 PDF。

我没有写代码。我在 Python 库 Pykepdf 和 Tabula[=119] 的文档中找到了它=]。 PyPDF2 解决方案由 Al Sweigart 在他的书“Automate the Boring Stuff with Python”，我强烈推荐。我还检查了代码是否工作正常，有我之前解释过的限制。

第一个问题，为什么我无法读取解密的文件，如果程序使用从未加密过的文件？

第二个问题，我们可以用 Python 以某种方式读取解密文件吗？哪个图书馆可以做到或不可能？所有解密的PDF都可以提取吗？

感谢您的宝贵时间和帮助！！！

我使用 Python 3.7、Windows 10、Jupiter Notebooks 和 Anaconda 2019.07 找到了这些结果。

Python import pikepdf with pikepdf.open("encrypted.pdf") as pdf: num_pages = len(pdf.pages) del pdf.pages[-1] pdf.save("decrypted.pdf") import tabula tabula.read_pdf("decrypted.pdf", stream=True) import PyPDF2 pdfFileObj=open("decrypted.pdf", "rb") pdfReader=PyPDF2.PdfFileReader(pdfFileObj) pdfReader.numPages pageObj=pdfReader.getPage(0) pageObj.extractText()

使用 Tabula，我收到消息 "the output file is empty."

使用 PyPDF2，我只得到 '/n'

更新 10/3/2019 Pdfminer.six（2018 年 11 月版）

我使用 DuckPuncher 发布的解决方案获得了更好的结果。对于解密文件，我得到了标签，但没有数据。加密文件也会发生同样的情况。对于从未加密过的文件，效果完美。 因为我需要加密或解密文件的数据和标签，所以这段代码对我不起作用。为了分析，我使用了pdfminer.six 即 Python 库，于 2018 年 11 月发布。Pdfminer.six 包括一个库 pycryptodome。根据他们的文档“PyCryptodome 是一个自包含的 Python 低级密码原语包..”

代码在栈交换题中： Extracting text from a PDF file using PDFMiner in python?

如果你想重复我的实验，我很乐意。说明如下：

1) 运行此问题中提及的代码与任何从未加密过的 PDF。

2) 对 PDF 做同样的事情 "Secure"（这是 Adobe 使用的术语），我称之为加密 PDF。使用可以使用 Google 找到的通用形式。下载后，您需要填写字段。否则，您将检查标签，而不是字段。数据在字段中。

3) 使用Pykepdf 解密加密的PDF。这将是解密后的 PDF。

4) 运行使用解密的 PDF 再次获取代码。

2019 年 10 月 4 日更新 Camelot（2019 年 7 月版）

我找到了 Python 库 Camelot。注意你需要 camelot-py 0.7.3.

它非常强大，适用于 Python 3.7。此外，它非常易于使用。首先，您还需要安装 Ghostscript。否则，它将不起作用。您还需要安装 Pandas。 不要使用 pip install camelot-py。而是使用 pip install camelot-py[cv]

该程序的作者是 Vinayak Mehta。 Frank Du 在 YouTube 视频中分享了此代码 "Extract tabular data from PDF with Camelot Using Python."

我检查了代码，它正在处理未加密的文件。 然而，它不适用于加密和解密的文件，这就是我的目标。

Camelot 适用于从 PDF 中获取表格。

代码如下：

Python import camelot import pandas name_table = camelot.read_pdf("uncrypted.pdf") type(name_table) #This is a Pandas dataframe name_table[0] first_table = name_table[0] #Translate camelot table object to a pandas dataframe first_table.df first_table.to_excel("unencrypted.xlsx") #This creates an excel file. #Same can be done with csv, json, html, or sqlite. #To get all the tables of the pdf you need to use this code. for table in name_table: print(table.df)

2019 年 10 月 7 日更新 我发现了一个窍门。如果我使用 Adobe Reader 打开受保护的 pdf，然后使用 Microsoft 将其打印为 PDF，并将其另存为 PDF，则可以使用该副本提取数据。我还可以将 PDF 文件转换为 JSON、Excel、SQLite、CSV、HTML 和其他格式。 这是我的问题的可能解决方案。但是，我仍在寻找一个没有那个技巧的选项，因为目标是用 Python. 100% 做到这一点我也担心如果有更好的方法使用加密技巧可能行不通。有时您需要多次使用 Adobe Reader 才能获得可提取的副本。

2019 年 10 月 8 日更新。第三题. 我现在有第三个问题。所有 secured/encrypted pdf 是否都受密码保护？为什么 pikepdf 不起作用？我的猜测是当前版本的 pikepdf 可以破解某些类型的加密，但不是全部。 @constt 提到 PyPDF2 可以破坏某种类型的保护。不过，我回复他说我找到一篇文章说 PyPDF2 可以破解 Adobe Acrobat Pro 6.0 的加密，但后面的版本不行。

Answer 1

最后更新时间 10-11-2019

我不确定我是否完全理解你的问题。下面的代码可以改进，但它会读取加密或未加密的 PDF 并提取文本。如果我误解了您的要求，请告诉我。

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_encrypted_pdf_text(path, encryption_true, decryption_password):

  output = StringIO()

  resource_manager = PDFResourceManager()
  laparams = LAParams()

  device = TextConverter(resource_manager, output, codec='utf-8', laparams=laparams)

  pdf_infile = open(path, 'rb')
  interpreter = PDFPageInterpreter(resource_manager, device)

  page_numbers = set()

  if encryption_true == False:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, caching=True, check_extractable=True):
      interpreter.process_page(page)

  elif encryption_true == True:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, password=decryption_password, caching=True, check_extractable=True):
      interpreter.process_page(page)

 text = output.getvalue()
 pdf_infile.close()
 device.close()
 output.close()
return text

results = extract_encrypted_pdf_text('encrypted.pdf', True, 'password')
print (results)

我注意到您用于打开加密 PDF 的 pikepdf 代码缺少密码，应该会抛出此错误消息：

pikepdf._qpdf.PasswordError: encrypted.pdf: invalid password

import pikepdf

with pikepdf.open("encrypted.pdf", password='password') as pdf:
num_pages = len(pdf.pages)
del pdf.pages[-1]
pdf.save("decrypted.pdf")

您可以使用 tika 从 pikepdf 创建的 decrypted.pdf 中提取文本。

from tika import parser

parsedPDF = parser.from_file("decrypted.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')

Additionally, pikepdf does not currently implement text extraction this includes the latest release v1.6.4.

我决定运行使用各种加密的 PDF 文件进行一些测试。

我把所有的加密文件都命名为'encrypted.pdf'，它们都使用相同的加解密密码

Adobe Acrobat 9.0 及更高版本 - 加密级别 256 位 AES
- pikepdf 能够解密此文件
- PyPDF2 无法正确提取文本
- tika 可以正确提取文本
Adobe Acrobat 6.0 及更高版本 - 加密级别 128 位 RC4
- pikepdf 能够解密此文件
- PyPDF2 无法正确提取文本
- tika 可以正确提取文本
Adobe Acrobat 3.0 及更高版本 - 加密级别 40 位 RC4
- pikepdf 能够解密此文件
- PyPDF2 无法正确提取文本
- tika 可以正确提取文本
Adobe Acrobat 5.0 及更高版本 - 加密级别 128 位 RC4
- 使用 Microsoft Word 创建
- pikepdf 能够解密此文件
- PyPDF2 可以正确提取文本
- tika 可以正确提取文本
Adobe Acrobat 9.0 及更高版本 - 加密级别 256 位 AES
- 使用 pdfprotectfree 创建
- pikepdf 能够解密此文件
- PyPDF2 可以正确提取文本
- tika 可以正确提取文本

PyPDF2 was able to extract text from decrypted PDF files not created with Adobe Acrobat.

I would assume that the failures have something to do with embedded formatting in the PDFs created by Adobe Acrobat. More testing is required to confirm this conjecture about the formatting.

tika was able to extract text from all the documents decrypted with pikepdf.

 import pikepdf
 with pikepdf.open("encrypted.pdf", password='password') as pdf:
    num_pages = len(pdf.pages)
    del pdf.pages[-1]
    pdf.save("decrypted.pdf")


 from PyPDF2 import PdfFileReader

 def text_extractor(path):
   with open(path, 'rb') as f:
     pdf = PdfFileReader(f)
     page = pdf.getPage(1)
     print('Page type: {}'.format(str(type(page))))
     text = page.extractText()
     print(text)

    text_extractor('decrypted.pdf')

PyPDF2 无法解密 Acrobat PDF 文件 => 6.0

This issue has been open with the module owners, since September 15, 2015. It unclear in the comments related to this issue when this problem will be fixed by the project owners. The last commit was June 25, 2018.

PyPDF4解密问题

PyPDF4 is the replacement for PyPDF2. This module also has decryption issues with certain algorithms used to encrypt PDF files.

test file: Adobe Acrobat 9.0 and later - encryption level 256-bit AES

PyPDF2 error message: only algorithm code 1 and 2 are supported

PyPDF4 error message: only algorithm code 1 and 2 are supported. This PDF uses code 5

更新部分 10-11-2019

This section is in response to your updates on 10-07-2019 and 10-08-2019.

In your update you stated that you could open a 'secured pdf with Adobe Reader' and print the document to another PDF, which removes the 'SECURED' flag. After doing some testing, I believe that have figured out what is occurring in this scenario.

Adobe PDF 安全级别

Adobe PDF 具有多种类型的安全控件，文档所有者可以启用这些控件。可以使用密码或证书强制执行控制。

文档加密（使用文档打开密码强制执行）
- 加密所有文档内容（最常见）
- 加密除元数据之外的所有文档内容 => Acrobat 6.0
- 仅加密文件附件 => Acrobat 7.0
限制编辑和打印（使用权限密码强制执行）
- 允许打印
- 允许更改

下图显示了使用 256 位 AES 加密技术加密的 Adobe PDF。要打开或打印此 PDF，需要密码。当您使用密码在 Adobe Reader 中打开此文档时，标题将显示 SECURED

此文档需要密码才能使用此答案中提到的 Python 模块打开。如果您尝试使用 Adobe Reader 打开加密的 PDF。你应该看到这个：

如果您没有收到此警告，则说明该文档未启用安全控制或仅启用了限制性编辑和打印控制。

下图显示了在 PDF 文档中使用密码启用的限制性编辑。 已启用注释打印。要打开或打印此 PDF，不需要密码。当您在没有密码的情况下在 Adobe Reader 中打开此文档时，标题将显示 SECURED 这与使用密码打开的加密 PDF 的警告相同。

当您将此文档打印为新的 PDF 时，SECURED 警告被移除，因为限制性编辑已被移除。

所有 Adobe 产品都强制执行权限密码设置的限制。但是，如果 third-party 产品不支持这些设置，文档 收件人可以绕过部分或全部限制 设置。

So I assume that the document that you are printing to PDF has restrictive editing enabled and does not have a password required to open enabled.

关于破解 PDF 加密

PyPDF2 或 PyPDF4 都不是为了破解 PDF 文档的文档打开密码功能而设计的。如果这两个模块尝试打开受密码保护的加密 PDF，它们都会抛出以下错误文件。

PyPDF2.utils.PdfReadError: file has not been decrypted

可以使用多种方法绕过加密 PDF 文件的打开密码功能，但由于包括密码复杂性在内的多种因素，单一技术可能行不通，有些技术也无法接受。

PDF 加密在内部使用 40、128 或 256 位的加密密钥，具体取决于 PDF 版本。二进制加密密钥源自用户提供的密码。密码受长度和编码限制。

例如，PDF 1.7 Adobe Extension Level 3 (Acrobat 9 - AES-256) 引入了 Unicode 字符（65,536 个可能的字符）并将密码的 UTF-8 表示形式的最大长度增加到 127 个字节。

The code below will open a PDF with restrictive editing enabled. It will save this file to a new PDF without the SECURED warning being added. The tika code will parse the contents from the new file.

from tika import parser
import pikepdf

# opens a PDF with restrictive editing enabled, but that still 
# allows printing.
with pikepdf.open("restrictive_editing_enabled.pdf") as pdf:
  pdf.save("restrictive_editing_removed.pdf")

  # plain text output
  parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf", xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  print (pdf)

This code checks if a password is required for opening the file. This code be refined and other functions can be added. There are several other features that can be added, but the documentation for pikepdf does not match the comments within the code base, so more research is required to improve this.

# this would be removed once logging is used
############################################
import sys
sys.tracebacklimit = 0
############################################

import pikepdf
from tika import parser

def create_pdf_copy(pdf_file_name):
  with pikepdf.open(pdf_file_name) as pdf:
    new_filename = f'copy_{pdf_file_name}'
    pdf.save(new_filename)
    return  new_filename

def extract_pdf_content(pdf_file_name):
  # plain text output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  parsedPDF = parser.from_file(pdf_file_name, xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  return pdf

def password_required(pdf_file_name):
  try:
    pikepdf.open(pdf_file_name)

  except pikepdf.PasswordError as error:
    return ('password required')

  except pikepdf.PdfError as results:
    return ('cannot open file')


filename = 'decrypted.pdf'
password = password_required(filename)
if password != None:
  print (password)
elif password == None:
  pdf_file = create_pdf_copy(filename)
  results = extract_pdf_content(pdf_file)
  print (results)

Answer 2

您可以尝试处理在没有密码的情况下打开这些文件时产生的错误。

import pikepdf

def open_pdf(pdf_file_path, pdf_password=''):
    try:
        pdf_obj = pikepdf.Pdf.open(pdf_file_path)

    except pikepdf._qpdf.PasswordError:
        pdf_obj = pikepdf.Pdf.open(pdf_file_path, password=pdf_password)

    finally:
        return pdf_obj

您可以使用返回的 pdf_obj 进行解析工作。此外，如果您有加密的 PDF，您可以提供密码。

Answer 3

对于 tabula-py，您可以尝试使用 read_pdf 的密码选项。这取决于 tabula-java 的功能，所以我不确定支持哪种加密。

Python 从加密的 PDF 中提取数据

Python Data Extraction from an Encrypted PDF

python

pdf

encryption

extract

pikepdf