管道从 GCS 输入 80 亿行并执行 GroupByKey 以防止融合，组步骤运行非常慢

Pipeline inputs 8 billion lines from GCS and does a GroupByKey to prevent fusion, group step running very slow

我从GCS读取了80亿行，对每一行进行处理，然后输出。我的处理步骤可能需要一点时间，以避免工人租约到期和低于错误；我对 80 亿做了一个 GroupByKey 并按 id 分组到 prevent fusion.

A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:

问题是即使在 1000 个 high-mem-2 节点上，GroupByKey 步骤也需要永远完成 80 亿行。

我调查了处理速度慢的可能原因； GroupByKey 为每个键生成的每个值的大小很大。我认为这是不可能的，因为在 80 亿个输入中，一个输入 ID 不能在该集合中出现超过 30 次。很明显热键的问题不在这里，其他事情正在发生。

欢迎任何关于如何优化它的想法。谢谢

我确实设法解决了这个问题。我这里有一些关于 . I was looking at my pipeline and the step with highest wall time; which was in days, I thought is the bottleneck. But in Apache beam a step is usually fused together with steps downstream in the pipeline, and will only run as fast as the step down the pipeline runs. So a wall time that is significant is not enough to conclude that this step is the bottleneck in the pipeline. The real solution to the problem stated above came from 的错误假设。我减少了管道运行的节点数量。并将节点类型从 high-mem-2 更改为 high-mem-4。我希望有一种简单的方法来获取数据流管道的内存使用指标。我必须通过 ssh 进入 VM 并执行 JMAP。

管道从 GCS 输入 80 亿行并执行 GroupByKey 以防止融合，组步骤运行非常慢

Pipeline inputs 8 billion lines from GCS and does a GroupByKey to prevent fusion, group step running very slow

etl

bigdata

google-cloud-dataflow

apache-beam

管道从 GCS 输入 80 亿行并执行 GroupByKey 以防止融合，组步骤 运行 非常慢

Pipeline inputs 8 billion lines from GCS and does a GroupByKey to prevent fusion, group step running very slow

etl

bigdata

google-cloud-dataflow

apache-beam

管道从 GCS 输入 80 亿行并执行 GroupByKey 以防止融合，组步骤运行非常慢