使用 pdf2image 从目录中提取 pdf 并将图像输出到不同的目录

Question

我正在尝试读取位于目录中的一些 pdf，并在不同的目录中输出其页面的图像。

（我正在寻求了解这段代码的工作原理，我希望有一种更简洁的方法来为我的图像文件指定输出目录。）

我所做的工作有效，但我认为它只是在我的保存目录和我的 pdf 目录之间来回跳动。

感觉这不是一个干净的方法。有没有更好的选择，它保留现有代码并完成我添加的代码行的功能？

import os
from pdf2image import convert_from_path

pdf_dir = r"mydirectorypathwithPDFs"
save_dir = 'mydirectorypathforimages'

os.chdir(pdf_dir)

for pdf_file in os.listdir(pdf_dir):
    os.chdir(pdf_dir) #I added this, change back to the pdf directory
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_file = pdf_file[:-4]
        for page in pages:
            os.chdir(save_dir) #I added this, change to the save directory
            page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

我稍作修改的代码是由@photek1944 创建的，可在此处找到：

Answer 1

这可能超出了您所要求的范围，但任何时候有人希望简化涉及 os 的代码以处理路径和文件，我总是喜欢推荐 Python's pathlib module, because it is awesome。以下是我个人将如何实施您的程序：

from pathlib import Path
from pdf2image import convert_from_path

# Use forward slashes here, even if you're on Windows.
pdf_dir = Path('my/directory/path/with/PDFs')
save_dir = Path('my/directory/path/for/images')

for pdf_file in pdf_dir.glob('*.pdf'):
    pages = convert_from_path(pdf_file, 300)
    for num, page in enumerate(pages, start=1):
        page.save(save_dir / f'{pdf_file.stem}-page{num}.jpg', 'JPEG')

pathlib 自动处理提供正确的分隔符（Windows 上的 \ 和 / 主要是其他任何地方），它允许您添加到带有 [=15= 的路径] 作为运算符，它使得使用 glob 方法搜索文件夹特别方便。它还公开了 name (blah.pdf)、stem (blah) 和 extension (.pdf) 等属性，以便更轻松地访问部件路径和文件名。

我还使用 f-string for more readable formatting, and enumerate 来跟踪页码。（我已将其设置为从 1 开始；我相信您的原始代码会将第一页编号为 0。）

使用 pdf2image 从目录中提取 pdf 并将图像输出到不同的目录

Extract pdfs from a directory and output images to a different directory with pdf2image

python

pdf

directory

chdir