从 zip 文件中读取 pdf
Reading a pdf from a zipfile
我正在尝试让 PyPDF2 读取一个简单的 zip 文件中的一个小 .pdf 文件。到目前为止,这是我得到的:
import PyPDF2,zipfile
with zipfile.ZipFile("TEST.zip") as z:
filename = z.namelist()[0]
a = z.filelist[0]
b = z.open(filename)
c = z.read(filename)
PyPDF2.PdfFileReader(b)
错误信息:
PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079]
io.UnsupportedOperation: seek
该文件尚未提取,因此您无法使用 open()
对其进行操作。
不过没关系,因为 PdfFileReader wants a stream; so we can provide it using BytesIO。下面的示例获取解压缩的字节,并将它们提供给 BytesIO,使它们成为 PdfFileReader 的流。如果您省略了 BytesIO,您将得到:AttributeError: 'bytes' object has no attribute 'seek'
.
import PyPDF2,zipfile
from io import BytesIO
with zipfile.ZipFile('sample.zip','r') as z:
filename = z.namelist()[0]
pdf_file = PyPDF2.PdfFileReader(BytesIO(z.read(filename)))
结果:
In [20]: pdf_file
Out[20]: <PyPDF2.pdf.PdfFileReader at 0x7f01b61db2b0>
In [21]: pdf_file.getPage(0)
Out[21]:
{'/Type': '/Page',
'/Parent': {'/Type': '/Pages',
'/Count': 2,
'/Kids': [IndirectObject(4, 0), IndirectObject(6, 0)]},
'/Resources': {'/Font': {'/F1': {'/Type': '/Font',
'/Subtype': '/Type1',
'/Name': '/F1',
'/BaseFont': '/Helvetica',
'/Encoding': '/WinAnsiEncoding'}},
'/ProcSet': ['/PDF', '/Text']},
'/MediaBox': [0, 0, 612, 792],
'/Contents': {}}
我正在尝试让 PyPDF2 读取一个简单的 zip 文件中的一个小 .pdf 文件。到目前为止,这是我得到的:
import PyPDF2,zipfile
with zipfile.ZipFile("TEST.zip") as z:
filename = z.namelist()[0]
a = z.filelist[0]
b = z.open(filename)
c = z.read(filename)
PyPDF2.PdfFileReader(b)
错误信息:
PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079] io.UnsupportedOperation: seek
该文件尚未提取,因此您无法使用 open()
对其进行操作。
不过没关系,因为 PdfFileReader wants a stream; so we can provide it using BytesIO。下面的示例获取解压缩的字节,并将它们提供给 BytesIO,使它们成为 PdfFileReader 的流。如果您省略了 BytesIO,您将得到:AttributeError: 'bytes' object has no attribute 'seek'
.
import PyPDF2,zipfile
from io import BytesIO
with zipfile.ZipFile('sample.zip','r') as z:
filename = z.namelist()[0]
pdf_file = PyPDF2.PdfFileReader(BytesIO(z.read(filename)))
结果:
In [20]: pdf_file
Out[20]: <PyPDF2.pdf.PdfFileReader at 0x7f01b61db2b0>
In [21]: pdf_file.getPage(0)
Out[21]:
{'/Type': '/Page',
'/Parent': {'/Type': '/Pages',
'/Count': 2,
'/Kids': [IndirectObject(4, 0), IndirectObject(6, 0)]},
'/Resources': {'/Font': {'/F1': {'/Type': '/Font',
'/Subtype': '/Type1',
'/Name': '/F1',
'/BaseFont': '/Helvetica',
'/Encoding': '/WinAnsiEncoding'}},
'/ProcSet': ['/PDF', '/Text']},
'/MediaBox': [0, 0, 612, 792],
'/Contents': {}}