为什么使用“--requirements_file”将依赖项上传到 GCS？

Question

我目前正在使用这些参数生成模板：

        --runner DataflowRunner \
        --requirements_file requirements.txt \
        --project ${GOOGLE_PROJECT_ID} \
        --output ${GENERATED_FILES_PATH}/staging \
        --staging_location=${GENERATED_FILES_PATH}/staging \
        --temp_location=${GENERATED_FILES_PATH}/temp \
        --template_location=${GENERATED_FILES_PATH}/templates/calculation-template \

并且 SDK 正在将 requirements.txt 内指定的依赖项上传到暂存部分内的 GCS。我不明白...对我来说，使用这种文件将允许工作人员直接从官方 pip 注册表中提取依赖项，而不是从我的 GCS 中提取依赖项，对吗？

这使得运行这个命令很长，因为它需要上传包：/

任何解释为什么会发生？也许我做错了什么？

谢谢，

Answer 1

我相信这样做是为了使 Dataflow worker 启动过程更加高效和一致（无论是在初始阶段还是在自动缩放时）。否则，每次启动 Dataflow worker 时，该 worker 都必须直接连接到 PyPI 以查找最新的匹配版本的依赖项。取而代之的是，一组依赖项在管道启动时暂存，并在整个管道执行过程中始终安装在工作人员中。

为什么使用“--requirements_file”将依赖项上传到 GCS？

Why using "--requirements_file" uploads dependencies onto GCS?

python

dataflow

google-cloud-dataflow

apache-beam