使用 PDFMiner.Six 将 xml 的 pdf 读入内存时出现问题

Question

考虑以下片段：

import io
result = io.StringIO()
with open("file.pdf") as fp:
    extract_text_to_fp(fp, result, output_type='xml')

data = result.getvalue()

这会导致以下错误

ValueError: Codec is required for a binary I/O output

如果我遗漏了output_type，我会得到错误

`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.

我不明白为什么会这样，希望得到解决方法的帮助。

Answer 1

我想出了解决问题的方法：首先你需要以二进制模式打开"file.pdf"。然后，如果您想读取内存，请使用 BytesIO 而不是 StringIO 并对其进行解码。例如

import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
    extract_text_to_fp(fp, result, output_type='xml')

data = result.getvalue().decode("utf-8")

使用 PDFMiner.Six 将 xml 的 pdf 读入内存时出现问题

Problem reading pdf to xml into memory using PDFMiner.Six

python

python-3.x

pdfminer