在 pyspark 中获取超过 window 的最大值

Question

我在 pyspark 中获得了特定 window 的最大值。但是从该方法中 return 得到的结果并不是预期的。

这是我的代码：

test = spark.createDataFrame(DataFrame({'grp': ['a', 'a', 'b', 'b'], 'val': [2, 3, 3, 4]}))
win = Window.partitionBy('grp').orderBy('val')
test = test.withColumn('row_number', F.row_number().over(win))
test = test.withColumn('max_row_number', F.max('row_number').over(win))
display(test)

输出为：

我预计“a”组和“b”组都会 return 2，但事实并非如此。

有人对这个问题有想法吗？非常感谢！

Answer 1

这里的问题出在 max 函数的框架上。如果您按顺序订购 window，框架将是 Window.unboundedPreceding, Window.currentRow。所以你可以定义另一个 window 在那里你放下订单（因为 max 函数不需要它）：

w2 = Window.partitionBy('grp')

你可以在 PySpark 中看到 docs:

Note When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.

在 pyspark 中获取超过 window 的最大值

Get the max value over the window in pyspark

apache-spark

apache-spark-sql

pyspark

pyspark-dataframes