如何从 PDF 的内容中提取（十六进制编码的）文本？

Question

我有两个版本的 PDF，我知道 they're slightly different——第 3 页灰色栏中的“重新评估”文本：

我正在尝试在我的机器上获取文本差异。

我使用了多页 PDF 中的 pdfcpu to extract the content，然后通过 diff 实用程序使用了运行第 3 页：

% diff out_orig/page_3.txt out_new/page_3.txt 

1650a1651,1658
> BT
> 1 0 0 rg
> 0 i 
> /RelativeColorimetric ri
> /C2_2 9.96 Tf
> 0 Tw 358.147 648.779 Td
> <0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056>Tj
> ET

我在 PDF 参考中查找了 7.3.4.3 Hexadecimal String：

A hexadecimal string shall be written as a sequence of hexadecimal digits encoded as ASCII characters and enclosed within angle brackets.

所以我想我应该能够做一些简单的事情，比如将十六进制字符直接解释为 ASCII 文本：

>>> s = '0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056'
>>> import binascii
>>> binascii.a2b_hex(s)
b'\x005\x00H\x00D\x00V\x00V\x00H\x00V\x00V\x00P\x00H\x00Q\x00W\x00\x03\x000\x00X\x00V\x00W\x00\x03\x002\x00F\x00F\x00X\x00U\x00\x03\x00(\x00Y\x00H\x00U\x00\\x00\x03\x00\x16\x00\x03\x000\x00R\x00Q\x00W\x00K\x00V'

但我得到的是垃圾。即使没有空字节：

>>> binascii.a2b_hex(s).replace(b'\x00', b'')
b'5HDVVHVVPHQW\x030XVW\x032FFXU\x03(YHU\\x03\x16\x030RQWKV'

我希望它看起来像这样（相反）：

>>> binascii.b2a_hex(b'Reassessment Must Occur Every 3 Months')
b'52656173736573736d656e74204d757374204f636375722045766572792033204d6f6e746873'

我在 somewhat-related SO post 上找到了这条评论：

Literal string (7.3.4.2) - this is pretty much straight-forward, as you just walk the data for "(.?)" * - That's only true for simple examples using standard font encoding. Meanwhile custom encodings for embedded fonts have become very common.

所以...也许那个十六进制字符串不仅仅是十六进制编码的 ASCII？

我在尝试提取文本差异时遗漏了什么？

Answer 1

开始吧：

>>> s = '0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056'
>>> ns = [29 + int(c, 16) for c in chunks(s, 4)]
>>> print(bytes(ns))
b'Reassessment Must Occur Every 3 Months'

chunks 复制自 here。

如何从 PDF 的内容中提取（十六进制编码的）文本？

How do I extract (hexadecimal-encoded) text from the content of a PDF?

python

pdf