使用 apache beam 从 GCS 读取文件时遇到性能问题

Question

我试图使用 gcs 路径中的通配符读取数据。我的文件是 bzip2 格式，gcs 路径中有大约 300k 个文件具有相同的通配符表达式。我正在使用下面的代码片段来读取文件。

    PCollection<String> val = p
            .apply(FileIO.match()
                    .filepattern("gcsPath"))
            .apply(FileIO.readMatches().withCompression(Compression.BZIP2))
            .apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
                try {
                    return f.readFullyAsUTF8String();
                } catch (IOException e) {
                    return null;
                }
            }));

但是性能非常差，以目前的速度使用上面的代码读取该文件需要大约 3 天的时间。有没有其他选择 api 我可以在云数据流中使用来从 gcs 读取这么多文件，当然性能很好。我之前使用过 TextIO，但由于模板序列化限制为 20MB，所以它失败了。

Answer 1

下面的 TextIO() 代码解决了这个问题。

PCollection<String> input = p.apply("Read file from GCS",TextIO.read().from(options.getInputFile())
                        .withCompression(Compression.AUTO).withHintMatchesManyFiles()
                        );

withHintMatchesManyFiles() 解决了这个问题。但是我仍然不知道什么时候 FileIO 性能如此糟糕。

使用 apache beam 从 GCS 读取文件时遇到性能问题

Facing Performance issue while reading files from GCS using apache beam

google-cloud-platform

google-cloud-dataflow

apache-beam

apache-beam-io