在 Ruby 中检查 PDF 是否损坏（或只是缺少 EOF）的最快方法？

Question

我正在寻找一种方法来检查 PDF 是否缺少文件结尾字符。到目前为止，我发现我可以使用 pdf-reader gem 并捕获 MalformedPDFError 异常，或者当然我可以简单地打开整个文件并检查最后一个字符是否为 EOF。我需要处理很多可能很大的 PDF，并且我想加载尽可能少的内存。

注意：我要检测的所有文件都缺少 EOF 标记，所以我觉得这是比检测一般 PDF "corruption" 更具体的场景。最好、最快的方法是什么？

Answer 1

TL;DR

查找 %%EOF，无论是否有相关结构，即使您扫描整个大小合理的 PDF 文件也相对较快。但是，如果您将搜索限制在最后千字节，或者如果您只是想验证 %%EOF\n 是 PDF 文件最后一行的唯一内容，则可以将搜索限制在最后 6 或 7 个字节。

请注意，只有完整解析 PDF 文件才能告诉您文件是否已损坏，只有完整解析文件预告片才能完全验证预告片是否符合标准。但是，我在下面提供了两个近似值，它们在一般情况下相当准确且相对较快。

检查文件尾部的最后千字节

这个选项相当快，因为它只查看文件的尾部，并使用字符串比较而不是正则表达式匹配。 According to Adobe:

Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.

因此，以下将通过在该范围内查找文件尾部指令来工作：

def valid_file_trailer? filename
  File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end

通过正则表达式对文件尾部进行更严格的检查

然而，ISO standard 更复杂也更严格。它说，部分：

The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).

如果不实际解析 PDF，您将无法使用正则表达式以完美的准确性对其进行验证，但您可以接近。例如：

def valid_file_trailer? filename
  pattern = /^startxref\n\d+\n%%EOF\n\z/m
  File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end

在 Ruby 中检查 PDF 是否损坏（或只是缺少 EOF）的最快方法？

Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

ruby

pdf

pdf-reader

TL;DR

检查文件尾部的最后千字节

通过正则表达式对文件尾部进行更严格的检查