从 python 中的 PDF 中提取图像

Question

我正在尝试使用 PyPDF2 从 pdf 中提取图像，但是当我的代码获取它时，图像与它实际看起来的样子有很大不同，请看下面的示例：

但它应该是这样的：

这是我正在使用的 pdf：

https://www.hbp.com/resources/SAMPLE%20PDF.pdf

这是我的代码：

pdf_filename = "SAMPLE.pdf"
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(0)

xObject = page['/Resources']['/XObject'].getObject()
i = 0
for obj in xObject:
    # print(xObject[obj])
    if xObject[obj]['/Subtype'] == '/Image':
        if xObject[obj]['/Filter'] == '/DCTDecode':
            data = xObject[obj]._data
            img = open("{}".format(i) + ".jpg", "wb")
            img.write(data)
            img.close()
            i += 1

并且由于我需要将图像保持在其颜色模式，如果它是 CMYK，我不能只将其转换为 RBG，因为我需要该信息。此外，我正在尝试从我从 pdf 获得的图像中获取 dpi，该信息是否始终存储在图像中？提前致谢

Answer 1

希望这有效：您可能需要使用另一个库，例如 Pillow:

这是一个例子：


    from PIL import Image
    image = Image.open("path_to_image")
    if image.mode == 'CMYK':
        image = image.convert('RGB')
    image.write("path_to_image.jpg")

参考：Convert from CMYK to RGB

Answer 2

我使用 pdfreader 从您的示例中提取图像。该图像使用 ICCBased 颜色空间，值为 N=4 和 Intent 值为 相对比色。这意味着 "closest" PDF 色彩空间是 DeviceCMYK.

您只需将图像转换为 RGB 并反转颜色即可。

代码如下：

from pdfreader import SimplePDFViewer
import PIL.ImageOps 

fd = open("SAMPLE PDF.pdf", "rb")
viewer = SimplePDFViewer(fd)

viewer.render()
img = viewer.canvas.images['Im0']

# this displays ICCBased 4 RelativeColorimetric
print(img.ColorSpace[0], img.ColorSpace[1].N, img.Intent)

pil_image = img.to_Pillow()
pil_image = pil_image.convert("RGB")
inverted = PIL.ImageOps.invert(pil_image)


inverted.save("sample.png")

阅读有关 PDF 对象的更多信息：图像 (sec. 8.9.5), InlineImage (sec. 8.9.7)

从 python 中的 PDF 中提取图像

Extract an image from a PDF in python

python

pdf

image

extraction

pypdf2