Beam SQL / Apache Beam 在运行多个连接时变慢

Question

虽然使用 Beam SQL 在 2 tables 上进行连接，但它可以正常工作并提供预期的性能，但是随着我的连接 Tables 的增加，性能变得最差。

下面是我的代码片段，可以帮助您调试我在 Beam 中的加入条件 SQL 以获得更好的性能。

PCollection<Row> outputStream2 = PCollectionTuple.of(new TupleTag<>("corporation1"), sourceData)
                .and(new TupleTag<>("dim"), dimtable).and(new TupleTag<>("place"), placeData)
                .and(new TupleTag<>("principle"), principle).apply(SqlTransform.query(
                        "Select d.merchant,d.corporation1,d.place,d.principal,c.corporation1_sk,r.place_sk,p.principal_sk FROM dim d LEFT JOIN corporation1 c ON c.corporation1 = d.corporation1 LEFT JOIN place p ON p.place = d.place and c.corporation1 = p.corporation1 "));

我可以在 Beam SQL/ Apache Beam 上进行连接的任何更好的方法，因为 Table、

中的顺序连接

前一个输出负责下一个 table 连接。我也尝试过使用 Co-GroupBy 和 SideInput 混合方法，其中 Table 中的数据低于 5K 我采用了 SideInput，而数据高于 50K 时使用 Co-GroupBy 进行连接，但性能不达标。

Answer 1

您似乎遇到了与 this 问题类似的问题，该问题目前没有修复的预计到达时间。 Beam SQL 本身目前并没有做很多 JOIN 优化，它根据接收到的输入类型选择最合适的方法（side-input，CoGBK），但仅此而已，你不能否则控制它。

在不知道您的具体设置的情况下很难确定，例如你有什么样的数据源，或者你如何确保使用侧输入与 CoGBK，或者你使用什么运行器，或者你期望的性能与你实际观察到的性能。

Beam SQL / Apache Beam 在运行多个连接时变慢

Beam SQL / Apache Beam is Slower when Running Multiple Joins

google-cloud-platform

google-cloud-dataflow

apache-beam

beam-sql

Beam SQL / Apache Beam 在 运行 多个连接时变慢

Beam SQL / Apache Beam is Slower when Running Multiple Joins

google-cloud-platform

google-cloud-dataflow

apache-beam

beam-sql

Beam SQL / Apache Beam 在运行多个连接时变慢