"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 2491: invalid start byte"

Question

在此代码中：

subprocess.getoutput('./pdftotext file.pdf -')

我也试过 UTF-16:

subprocess.check_output('./pdftotext file.pdf -', shell=True, encoding='utf-16')

来自 https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf 的 PDF，但我看到另一个 PDF 出现相同的错误（具有另一个字节值）。

这个来自 Mozilla 的 PDF 在没有 Python 的情况下表现良好 Bash。

我也尝试了param universal_newlines=True，例如：

return subprocess.run(
      './pdftotext file.pdf -',
      shell=True,
      stdout=subprocess.PIPE,
      universal_newlines=True
).stdout

Python Lambda 3.6。

Answer 1

"Filter" 输出 iconv - 忽略错误：

subprocess.getoutput('./pdftotext file.pdf - | iconv --to-code utf-8//IGNORE')

随时添加您的答案 - 只是对替代解决方案和问题的根源感到好奇。

Answer 2

尝试下一个代码：

return subprocess.run(
      './pdftotext file.pdf -',
      shell=True,
      stdout=subprocess.PIPE,
      universal_newlines=True,
      encoding='your encoding',
      errors='ignore', # 'ignore' or 'replace'
).stdout

更多关于： https://docs.python.org/3/library/stdtypes.html#bytes.decode https://docs.python.org/3/library/codecs.html#error-handlers

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 2491: invalid start byte"

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 2491: invalid start byte"

python

unicode

subprocess

pdftotext

python-3.x