Pyspark:不是分区的累计和

Pyspark: not cumulative sum over partition

我想对一个分区求和,结果不是累加和,而是每个分区的总和:

发件人:

Category A Category B Value
1 2 100
1 2 150
2 1 110
2 2 200

我想要:

Category A Category B Value Sum
1 2 100 250
1 2 150 250
2 1 110 110
2 2 200 200

有:

from pyspark.sql.functions import sum
from pyspark.sql.window import Window
windowSpec = Window.partitionBy(["Category A","Category B"])
df = df.withColumn('sum', sum(df.Value).over(windowSpec))

我没有得到我想要的结果,我得到的是累计和:

Category A Category B Value Sum
1 2 100 100
1 2 150 250
2 1 110 110
2 2 200 200

我该如何继续?谢谢

定义 window 时,您可以为 window 指定 range

您可以指定范围 (Window.unboundedPreceding, Window.unboundedFollowing) 以对每个分区内的所有行求和,而不考虑行的顺序:

windowSpec = Window.partitionBy(["Category A","Category B"])\
    .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('sum', F.sum(df.Value).over(windowSpec))\
    .orderBy("Category A", "Category B").show()

打印

+----------+----------+-----+-----+
|Category A|Category B|Value|  sum|
+----------+----------+-----+-----+
|         1|         2|  100|250.0|
|         1|         2|  150|250.0|
|         2|         1|  110|110.0|
|         2|         2|  200|200.0|
+----------+----------+-----+-----+