如何使用 GCP 上的 Apache Beam Dataflow 手动将可执行文件复制到工作人员

How to manually copy executable to workers with Apache Beam Dataflow on GCP

Beam 和 GCP 有点新。在 this document and using the Beam 'subprocess' examples 之后,我一直在研究一个简单的 Java 管道,它 运行 是一个 C 二进制文件。使用 directRunner 时 运行 没问题,我现在正试图将其传输到云端 运行。将文件暂存到 gs 存储桶中后,我收到错误:'Cannot run program "gs://mybucketname/tmp/grid_working_files/Echo": error=2, No such file or directory' 这很合理,因为我猜你不能直接从云存储中执行?我现在遇到的问题是如何将可执行文件移动到工作程序。该文件指出:

When you use a native Apache Beam language (Java or Python), the Beam SDK automatically moves all required code to the workers. However, when you make a call to external code, you need to move the code manually.  To move the code, you do the following:

  1. Store the compiled external code, along with versioning information, in Cloud Storage.
  2. In the @Setup method, create a synchronized block to check whether the code file is available on the local resource. Rather than implementing a physical check, you can confirm availability using a static variable when the first thread finishes.
  3. If the file isn't available, use the Cloud Storage client library to pull the file from the Cloud Storage bucket to the local worker. A recommended approach is to use the Beam FileSystems class for this task.
  4. After the file is moved, confirm that the execute bit is set on the code file.
  5. In a production system, check the hash of the binaries to ensure that the file has been copied correctly.

我看过文件系统class,我想我明白了,但我不知道我需要将文件复制到哪里。是否有工作人员使用的已知目录或文件路径?我正在使用数据流 运行ner.

您可以将文件复制到工作人员本地文件系统中的任何位置,例如您可以使用 tempfile 模块创建一个新的空临时目录,用于在 运行.

之前复制您的可执行文件

使用 custom containers 可能也是一个很好的解决方案。