等价于 apache beam 中的重新分区

Question

在spark中，如果我们必须重新洗牌数据，我们可以使用数据帧的重新分区。在 apache beam 中为 pcollection 执行相同操作的方法是什么？

在 pyspark 中，

new_df = df.repartition(4)

Answer 1

从这个doc:

You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshuffle is supported by Dataflow even though it is marked deprecated in the Apache Beam documentation.

虽然我不确定 Reshuffle 是否会得到 Beam 的其他运行者的支持。

javadoc and further explanation of Reshuffle:

等价于 apache beam 中的重新分区

Equivalent of repartition in apache beam

apache-spark

google-cloud-dataflow

pyspark

apache-beam

apache-beam-internals