如何使用 GCP 上的 Apache Beam Dataflow 手动将可执行文件复制到工作人员
How to manually copy executable to workers with Apache Beam Dataflow on GCP
Beam 和 GCP 有点新。在 this document and using the Beam 'subprocess' examples 之后,我一直在研究一个简单的 Java 管道,它 运行 是一个 C 二进制文件。使用 directRunner 时 运行 没问题,我现在正试图将其传输到云端 运行。将文件暂存到 gs 存储桶中后,我收到错误:'Cannot run program "gs://mybucketname/tmp/grid_working_files/Echo": error=2, No such file or directory' 这很合理,因为我猜你不能直接从云存储中执行?我现在遇到的问题是如何将可执行文件移动到工作程序。该文件指出:
When you use a native Apache Beam language (Java or Python), the Beam SDK automatically moves all required code to the workers. However, when you make a call to external code, you need to move the code manually.

To move the code, you do the following:
- Store the compiled external code, along with versioning information, in Cloud Storage.
- In the @Setup method, create a synchronized block to check whether the code file is available on the local resource. Rather than implementing a physical check, you can confirm availability using a static variable when the first thread finishes.
- If the file isn't available, use the Cloud Storage client library to pull the file from the Cloud Storage bucket to the local worker. A recommended approach is to use the Beam FileSystems class for this task.
- After the file is moved, confirm that the execute bit is set on the code file.
- In a production system, check the hash of the binaries to ensure that the file has been copied correctly.
我看过文件系统class,我想我明白了,但我不知道我需要将文件复制到哪里。是否有工作人员使用的已知目录或文件路径?我正在使用数据流 运行ner.
您可以将文件复制到工作人员本地文件系统中的任何位置,例如您可以使用 tempfile
模块创建一个新的空临时目录,用于在 运行.
之前复制您的可执行文件
使用 custom containers 可能也是一个很好的解决方案。
Beam 和 GCP 有点新。在 this document and using the Beam 'subprocess' examples 之后,我一直在研究一个简单的 Java 管道,它 运行 是一个 C 二进制文件。使用 directRunner 时 运行 没问题,我现在正试图将其传输到云端 运行。将文件暂存到 gs 存储桶中后,我收到错误:'Cannot run program "gs://mybucketname/tmp/grid_working_files/Echo": error=2, No such file or directory' 这很合理,因为我猜你不能直接从云存储中执行?我现在遇到的问题是如何将可执行文件移动到工作程序。该文件指出:
When you use a native Apache Beam language (Java or Python), the Beam SDK automatically moves all required code to the workers. However, when you make a call to external code, you need to move the code manually.  To move the code, you do the following:
- Store the compiled external code, along with versioning information, in Cloud Storage.
- In the @Setup method, create a synchronized block to check whether the code file is available on the local resource. Rather than implementing a physical check, you can confirm availability using a static variable when the first thread finishes.
- If the file isn't available, use the Cloud Storage client library to pull the file from the Cloud Storage bucket to the local worker. A recommended approach is to use the Beam FileSystems class for this task.
- After the file is moved, confirm that the execute bit is set on the code file.
- In a production system, check the hash of the binaries to ensure that the file has been copied correctly.
我看过文件系统class,我想我明白了,但我不知道我需要将文件复制到哪里。是否有工作人员使用的已知目录或文件路径?我正在使用数据流 运行ner.
您可以将文件复制到工作人员本地文件系统中的任何位置,例如您可以使用 tempfile
模块创建一个新的空临时目录,用于在 运行.
使用 custom containers 可能也是一个很好的解决方案。