如何将 Apache Crunch 的输出写入 Amazon S3 存储桶

Question

有没有一种方法可以将 Apache Crunch 输出写入 S3 存储桶。 crunch pipeline write 中有一个方法以Target为参数。有没有办法将S3添加为目标来编写紧缩方法。

Answer 1

您不能只在您的 PCollection 上使用 write 方法并将其提供给您的 S3 位置吗？

PCollection<String> items = ...;
items.write(To.avroFile("s3://bucket/prefix");
pipeline.done();

我们基本上就是这样做的，但是我们运行在 EMR 中。为了从本地集群迁移数据，我们使用 Hadoop dist-cp 命令。

How to write output of Apache Crunch to Amazon S3 bucket