TextIO.Read().From() 与 TextIO.ReadFiles() 对比 withHintMatchesManyFiles()

TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()

在我的用例中，从 Kafka 获取一组匹配的文件模式，

PCollection<String> filepatterns = p.apply(KafkaIO.read()...);

此处每个模式最多可以匹配 300 多个文件。

Q1。我如何使用 TextIO.Read() 来匹配来自 PCollection 的数据，因为 withHintMatchesManyFiles() 仅适用于 TextIO.Read() 而不适用于 TextIO.ReadFiles().

Q2。如果使用 FileIO.Match->FileIO.ReadMatch()->TextIO.ReadFiles() 的路径，withHintMatchesManyFiles() 在此路径中不可用，这将如何影响读取性能？

Q3。以上用例是否还有其他优化路径？

How can I use TextIO.Read() to match data from PCollection, as withHintMatchesManyFiles() available only for TextIO.Read() not for TextIO.ReadFiles().

我对 Apache Beam 的总体理解非常有限，尤其是 PTransforms 是 TextIO.read() 创建了一个只能在管道的最开始使用的根 PTransform。换句话说，TextIO.Read 不能在任何类型的 PTransform 之后使用。

是的，您不能立即使用 withHintMatchesManyFiles() 和 TextIO.ReadFiles()。实际上，TextIO.Read().withHintMatchesManyFiles() 是通过 FileIO 转换 + TextIO.ReadFiles() (see details) 实现的。这样，FileIO.readMatches() 应该将读取的文件分发给工作人员。

因此，我认为您可以在从 Kafka 主题读取文件名时使用相同的方法。

TextIO.Read().From() 与 TextIO.ReadFiles() 对比 withHintMatchesManyFiles()

TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()

apache-beam

apache-beam-io

apache-beam-internals