为什么如果我用魔杖从 pdf 中提取图像 jpg，它会在文本上变成黑色背景

Question

我对某些 pdf 文件有疑问。我需要将它们转换为 jpg 图像，使它们可用于 OCR，但是当我转换其中一些图像时，Wand 将我转换为 jpg，其中文本上有黑色背景。我看到这是 space 颜色的常见问题。它似乎发生在将文件 word 转换为 pdf 文件时，其中 space 颜色变为 CMYK。 Tesseract OCR 仅接受 space 颜色 RGB。我已经写了一个 python 转换脚本，但我想解决这个问题。你可以帮帮我吗？谢谢。原始页面 pdf 将 pdf 转换为 jpg

Answer 1

这是我的代码：

def convert_pdf(pdf_file):

    # Get name file
    title = os.path.splitext(os.path.basename(pdf_file))[0]
    basename = os.path.basename(pdf_file)
    pdf = wi(filename=pdf_file, resolution=100)
    pdfImage = pdf.convert("jpg")
    outputPath = PATH_IMAGES+"/" + basename
    if not os.path.exists(outputPath):
        os.mkdir(outputPath)

    i=1
    for img in pdfImage.sequence:
        page = wi(image=img)
        page.save(filename=outputPath+"/"+title+"(*page="+str(i)+"*)"+".jpg")
        imagePathConverted = outputPath+"/"+title+"(*page="+str(i)+"*)"+".jpg"
        '''image = Image.open(imagePathConverted)

        if image.mode != 'RGB':
            rgb_image = image.convert('RGB')
            rgb_image.save(imagePathConverted)'''
        i += 1

    return outputPath

Answer 2

解决方案是在调用保存之前设置这些：

page = wi(image=img)

page.background_color = Color('white')
page.alpha_channel = 'remove'

page.save(...)

感谢 this Stack Overflow 的回答。

为什么如果我用魔杖从 pdf 中提取图像 jpg，它会在文本上变成黑色背景

Why if i extract image jpg from pdf with wand, it turn me a black background over the text

python

ocr

rgb

cmyk

wand