使用 SideInput Apache Beam 的 Co-Group 加入解决方案

Joining Solution using Co-Group by SideInput Apache Beam

我有 2 Table 人要加入，这是 Left Join。下面是两个条件，我的管道是如何工作的。

作业运行处于批处理模式及其所有用户数据，我们希望在 Google 数据流中处理。

第 1 天：

Table答：5000000条记录。（大小 3TB）

Table乙：200条记录。（大小 1GB）

两个 Tables 通过 SideInput 加入，其中 TableB 数据被视为 SideInput 并且工作正常。

第 2 天：

Table答：5000010条记录。（大小 3.001TB）

Table乙：20000条记录。（大小 100GB）

第二天，我的管道变慢了，因为 SideInput 使用缓存并且我的缓存大小耗尽，因为 TableB 的大小增加了。

所以我尝试使用 Co-Group by，但是 Day 1 数据处理非常慢，日志：Having 10000 plus values on Single Key.

那么在引入热键时是否有更好的性能方法来执行加入。

一旦table B 不再适合缓存，性能确实会急剧下降，而且没有太多好的解决方案。使用 CoGroupByKey 的速度变慢不仅是因为单个键上有很多值，还因为你现在正在洗牌（又名分组）Table A（这在使用侧面输入时避免了） .

根据您的密钥分布，一种可能的缓解措施是将您的热键处理成一条路径，该路径像以前一样进行侧输入连接，并将您的长尾密钥处理成 GoGBK。这可以通过生成一个截断的 TableB' 作为辅助输入来完成，如果在 TableB' 中找到它，你的 ParDo 将尝试查找发送到一个 PCollection 的密钥，如果它在另一个不是 [1]。然后将第二个 PCollection 传递给具有所有 TableB 的 CoGroupByKey，并将结果展平。

[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs

使用 SideInput Apache Beam 的 Co-Group 加入解决方案

Joining Solution using Co-Group by SideInput Apache Beam

google-cloud-platform

google-cloud-dataflow

apache-beam