如何在 Google Cloud Function 上使用 Python pdf2image 模块(即 poppler)?

How to use the Python pdf2image module (thus poppler) on Google Cloud Function?

我尝试在 Google Cloud Functions 上将 PDF 转换为 JPEG。我使用了 Python 模块 pdf2image。但是我不知道如何解决云函数上的错误No such file or directory: 'pdfinfo'"Unable to get page count. Is poppler installed and in PATH?

错误代码与 this question. pdf2image 非常相似,是对 poppler 的“pdftopm”和“pdftocairo”的包装。但是如何在 google 云功能上安装 poppler 包,并将其添加到 PATH?我找不到相关的参考资料。这甚至可能吗?如果没有,可以做什么?

还有,不过没用

代码如下所示。入口点是 process_image.

import requests
from pdf2image import convert_from_path

def process_image(event, context):
    # Download sample pdf file
    url = 'https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf'
    r = requests.get(url, allow_redirects=True)
    open('/tmp/sample.pdf', 'wb').write(r.content)

    # Error occur on this line
    pages = convert_from_path('/tmp/sample.pdf')

    # Save pages to /tmp
    for idx, page in enumerate(pages):
        output_file_path = f"/tmp/{str(idx)}.jpg"
        page.save(output_file_path, 'JPEG')
        # To be saved to cloud storage

Requirement.txt:

requests==2.25.1
pdf2image==1.14.0

这是我得到的错误代码:

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 441, in pdfinfo_from_path
    proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 1706, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

在处理上述异常的过程中,又发生了一个异常:

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/functions_framework/__init__.py", line 149, in view_func
    function(data, context)
  File "/workspace/main.py", line 11, in process_image
    pages = convert_from_path('/tmp/sample.pdf')
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 97, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 467, in pdfinfo_from_path
    raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

在此先感谢您的帮助。

发生此错误是因为 poppler 包在 Cloud Functions 中不起作用,因为它需要将某些文件写入系统。不幸的是,您无法在 Cloud Functions 等无服务器产品中写入文件系统。

您可能想尝试其他线程中描述的方法,Cloud Functions for Firebase - Converting PDF to image 或考虑使用可以访问整个系统的 GCP Compute Engine。

Cloud Functions 不支持安装自定义系统级包(尽管它支持相关编程语言的第三方库以及 npm、pip 等包管理器)。如https://cloud.google.com/functions/docs/reference/system-packages所示,没有包“poppler”。

但是,您仍然可以使用其他预安装的软件包。 ghostscript可用于pdf转图片

首先,您应该将 pdf 文件保存在云功能中(例如从云存储)。您只有对 /tmp 的磁盘写入权限 (https://cloud.google.com/functions/docs/concepts/exec#file_system).

将 pdf 转换为 jpeg 的终端命令示例如下

gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile=output/file/path input/file/path

在 python 环境中使用命令的示例代码:

# download the file from google cloud storage
gcs = storage.Client(project=os.environ['GCP_PROJECT'])
bucket = gcs.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.download_to_filename(input_file_path)

# run ghostscript
cmd = f'gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile="{output_file_path}" {input_file_path}'.split(' ')
p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
stdout, stderr = p.communicate()
error = stderr.decode('utf8')
if error:
    logging.error(error)
    return

注意: 您可能想改用 imagemagick 包,它本身使用 ghostscript。但是,如 中所述,由于截至撰写本文时 (2021-07-12) Ghostscript 存在安全漏洞,ImageMagick 的 PDF 读取已被禁用。提供的解决方案本质上是 运行 ghostscript 的另一种方法。

参考: https://www.the-swamp.info/blog/google-cloud-functions-system-packages/