是否可以从不允许 "Page Extraction" 的 PDF 中提取文本?

Is it possible to extract text from PDF, whose "Page Extraction" is not allowed?

我能够从没有任何安全限制的 PDF 中提取文本。我只想知道是否可以从有限制的 PDF 中提取文本

更新:

Thanks to all for your comments. I appreciate your concern. Please understand the question. I did not ask how to do it. I just want to know if it is possible. I have created a PDF with these restrictions. I do not want my information to be extracted from my document. There are many developers who can achieve any task. I want to know if this task can be done. If this can be done, then I will investigate further to overcome this issue.

正如 OP 澄清的那样,他问这个问题是为了了解他的具有此类限制的文档是否可以安全地进行文本提取,并且他没有问如何去做(尽管有明确的语言和标签中给出的库),这里是对原则选项的回答,而不是具体的实现。于是...

是的,可以从有限制的文档中提取文本,只要文档完全可以阅读并且没有应用其他方法来防止文本提取。

您显示的限制只是向 PDF 处理器指示作者希望允许或不允许用户对其文档执行的操作的标志,但它们不是技术限制。

这些限制只能应用于加密文档,但您肯定希望这些限制特别适用于可以打开文档进行阅读的任何人(您除外),无论是通过知道特定用户密码还是通过使用空密码。

比照。规范 ISO 32000(此处来自第 2 部分,与第 1 部分类似,重点是 PDF 查看器):

If a user attempts to open an encrypted document that has a user password, the PDF reader shall first try to authenticate the encrypted document using the padding string defined in 7.6.4.3, "File encryption key algorithm" (default user password):

  • If this authentication attempt is successful, the PDF reader may open, decrypt, render and otherwise provide access to the document.

  • If this authentication attempt fails, the interactive PDF processor should prompt for a password. Correctly supplying either password (owner or user password) should enable the user to gain access to the document.

Whether additional operations shall be allowed on a decrypted document depends on which password (if any) was supplied when the document was opened and on any access restrictions that were specified when the document was created:

  • Opening the document with the correct owner password should allow full (owner) access to the document. This unlimited access includes the ability to change the document’s passwords and access permissions.

  • Opening the document with the correct user password (or opening a document with the default password) should allow additional operations to be performed according to the user access permissions specified in the document’s encryption dictionary.

Access permissions shall be specified in the form of flags corresponding to the various operations and the set of operations to which they correspond shall depend on the security handler’s revision number (also stored in the encryption dictionary).

...

Once the document has been opened and decrypted successfully, a PDF reader technically has access to the entire contents of the document. There is nothing inherent in PDF encryption that enforces the document permissions specified in the encryption dictionary. PDF readers shall respect the intent of the document creator by restricting user access to an encrypted PDF file according to the permissions contained in the file.

(ISO 32000-2 第 7.6.4 节标准安全处理程序)

因此,这些限制仅适用于协作的 PDF 处理器,但特别是在开源 PDF 库的情况下,程序员删除任何试图强制执行限制的代码是微不足道的。

意识到这一点,开源 PDF 库的开发人员通常根本不尝试强制执行限制,或者他们添加一些标志来覆盖限制强制执行以防止库的修补副本流通。