使用 Python 从 Azure Blob 存储下载的将多个 PDF 合并为一个的最佳方法?

Best way to merge multiple PDF's into one, downloaded from Azure Blob Storage using Python?

我正在尝试从 Azure 下载多个 PDF 文件并将它们(使用 PyPDF2 库)全部合并为一个 PDF 以重新上传到 azure。

我目前在第 pdf = PyPDF2.PdfFileReader(output) 行收到 PyPDF2.utils.PdfReadError: Unsupported PNG filter 4 错误。

    consolidated_pdf = review_level_str.title() + '.pdf'
    merger = PyPDF2.PdfFileMerger()
    
    for each_file in filename_lst:
        blob_client = blob_service.get_blob_client(container=f'{flask_env}-downloads', blob=each_file)
        blob_object = blob_client.download_blob()

        bytes_file = blob_object.readall()
        output = io.BytesIO()
        output.write(bytes_file)
        pdf = PyPDF2.PdfFileReader(output)
        merger.append(pdf)

    blob_client_pdf = blob_service.get_blob_client(container=f'{flask_env}-downloads', blob=consolidated_pdf)
    blob_client_pdf.upload_blob(pdf.getvalue())

试试这个:

from azure.storage.blob import ContainerClient 
from PyPDF2 import PdfFileMerger
import shutil,os


pdf_list = ['test1.pdf','test2.pdf']
container = 'pdf'
storage_conn_str = ''

tempPath = 'd:/home/temp2/'
os.mkdir(tempPath)

mergedObject = PdfFileMerger()
ContainerClient = ContainerClient.from_connection_string(storage_conn_str,container)

for pdf in pdf_list:
    localPdfPath = tempPath + pdf
    with open(localPdfPath, "wb") as download_file:
        download_file.write(ContainerClient.download_blob(pdf).readall())
    mergedObject.append(localPdfPath)

mergedPDFPath = tempPath + 'merged.pdf'
mergedObject.write(mergedPDFPath)
mergedObject.close()

with open(mergedPDFPath, "rb") as stream:
    ContainerClient.upload_blob('merged.pdf',stream, overwrite=True)

#remove all temp files after upload.
shutil.rmtree(tempPath)

结果:

勾选 merged.pdf:

如果您还有其他问题,请告诉我。