如何使用 python shell 在粘合作业中添加外部库

How to add external library in a glue job using python shell

我试图在 python-shell 中 运行 通过添加外部依赖项(如 pyathena、pytest 等 ..)作为 python egg 文件的 Glue 作业/whl 文件中的作业配置,如 AWS 文档中所述 https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html.

Glue 作业在没有互联网的 VPC 下配置,其执行导致以下错误。

WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fd05d6a4f28>, 'Connection to pypi.org timed out. (connect timeout=15)')'

我什至尝试使用以下代码修改我的 python 脚本

import os
import site
import importlib
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

libraries = ["pyathena"]

for lib in libraries:
    easy_install.main( ["--install-dir", install_path , lib] )

importlib.reload(site)

在执行上面的代码时我遇到了以下错误

Download error on https://pypi.org/simple/pyathena/: [Errno 99] Cannot assign requested address -- Some packages may not be found! Couldn't find index page for 'pyathena' (maybe misspelled?)

我能否提供示例代码片段来为外部 python 包生成 egg/whl 文件并添加 Glue python-shell 作业的那些部分

参考此doc which has steps in detail for packaging a python library. Also make sure that your VPC has s3 endpoint enter link description here,因为当您运行 VPC 内的 Glue 作业时,流量不会离开 AWS 网络。