是否可以使用 Google 提供的实用程序模板将已处理的文件存储到最初存储的位置？

Is it possbile to store processed files into where it was stored initially, using Google-provided utility templates?

Google Dataflow 实用程序模板之一允许我们对 GCS 中的文件进行压缩（批量压缩云存储文件）。

虽然可以为包含不同文件夹的参数设置多个输入（例如：inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/**.csv），但实际上是否可以存储 'compressed'/processed files into the same folder where it was initially stored?

如果你看看 the documentation:

The extensions appended will be one of: .bzip2, .deflate, .gz.

因此，新的压缩文件将与提供的模式 (*.csv) 不匹配。因此，您可以将它们存储在同一个文件夹中而不会发生冲突。

另外，这个过程是一个批处理过程。当您深入了解数据流 IO 组件时，尤其是使用模式读取 GCS 时，文件列表（要压缩的文件）在作业开始时被读取，因此在作业期间不会发生变化。

因此，如果您有新文件进入并且在作业期间与模式匹配，则当前作业不会考虑这些文件。您将不得不运行另一份工作来获取这些新文件。

最后，最后一件事：现有的未压缩文件不会被压缩文件替换。这意味着您将拥有双重文件：压缩版本和未压缩版本。为了节省 space（和金钱），我建议您删除两个版本中的一个。

是否可以使用 Google 提供的实用程序模板将已处理的文件存储到最初存储的位置？

Is it possbile to store processed files into where it was stored initially, using Google-provided utility templates?

dataflow

google-cloud-platform