Apache Beam：步骤 B 是否可以依赖步骤 A 而无需将 A 的输出传递给 B？

Question

当 A 不产生任何输出时，有没有办法让 PTransform B 依赖于 PTransform A？或者我是否必须让 A 产生一个虚拟输出，然后将其作为侧输入馈入 B？一个示例用例是我想要以下管道的地方：

Z = read file
A = count lines in file, and throw error if there are no lines
B = do something with the file

我希望 B 在 A 完成后才开始，但 A 不会产生任何对 B 有用的输出 PCollection。

Answer 1

这是可能的，但在您的情况下可能并不理想。添加这样的依赖项会减慢程序的并行执行速度，因为 B 需要等待 A 完成才能开始。

如果您真的想这样做，您所描述的方式——输出一个元素并将其用作 B 的辅助输入应该可行。请考虑以下内容，它允许您使用原始 Count 转换来实现 A，并将所有逻辑移动到一个地方：

Z = read file
A = count lines in file
B = side input from A, throw error if the count of lines was zero,     
    otherwise do something with the file

Apache Beam：步骤 B 是否可以依赖步骤 A 而无需将 A 的输出传递给 B？

Apache Beam: Can step B depend on step A without passing outputs of A into B?

google-cloud-dataflow

apache-beam