如何在 apache beam 中实现 groupby(column1,column2)

Question

我需要帮助在 python 中为以下 Spark sql 代码编写类似的 beam 代码。

count_mnm_df = (mnm_df
     .select("State", "Color", "Count") 
     .groupBy("State", "Color") 
     .agg(count("Count").alias("Total")) 
     .orderBy("Total", ascending=False)

Answer 1

可能最直接的映射到上面的是 Beam SQL。有关相应的 Python 转换，请参阅 here for more information. Please see here，其中还包含有关使用的信息。请注意，对 Python SDK 的支持是通过 Beam 的 cross-language 转换支持实现的，该支持相对较新。

您还可以考虑使用 available Beam transforms 编写执行相同计算的 Beam 管道。

请注意，Beam 不保证 PCollection.

元素的顺序

如何在 apache beam 中实现 groupby(column1,column2)

How to implement groupby(column1,column2) in apache beam

google-cloud-dataflow

apache-beam