Pyspark:不是分区的累计和
Pyspark: not cumulative sum over partition
我想对一个分区求和,结果不是累加和,而是每个分区的总和:
发件人:
Category A
Category B
Value
1
2
100
1
2
150
2
1
110
2
2
200
我想要:
Category A
Category B
Value
Sum
1
2
100
250
1
2
150
250
2
1
110
110
2
2
200
200
有:
from pyspark.sql.functions import sum
from pyspark.sql.window import Window
windowSpec = Window.partitionBy(["Category A","Category B"])
df = df.withColumn('sum', sum(df.Value).over(windowSpec))
我没有得到我想要的结果,我得到的是累计和:
Category A
Category B
Value
Sum
1
2
100
100
1
2
150
250
2
1
110
110
2
2
200
200
我该如何继续?谢谢
定义 window 时,您可以为 window 指定 range。
您可以指定范围 (Window.unboundedPreceding, Window.unboundedFollowing)
以对每个分区内的所有行求和,而不考虑行的顺序:
windowSpec = Window.partitionBy(["Category A","Category B"])\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('sum', F.sum(df.Value).over(windowSpec))\
.orderBy("Category A", "Category B").show()
打印
+----------+----------+-----+-----+
|Category A|Category B|Value| sum|
+----------+----------+-----+-----+
| 1| 2| 100|250.0|
| 1| 2| 150|250.0|
| 2| 1| 110|110.0|
| 2| 2| 200|200.0|
+----------+----------+-----+-----+
我想对一个分区求和,结果不是累加和,而是每个分区的总和:
发件人:
Category A | Category B | Value |
---|---|---|
1 | 2 | 100 |
1 | 2 | 150 |
2 | 1 | 110 |
2 | 2 | 200 |
我想要:
Category A | Category B | Value | Sum |
---|---|---|---|
1 | 2 | 100 | 250 |
1 | 2 | 150 | 250 |
2 | 1 | 110 | 110 |
2 | 2 | 200 | 200 |
有:
from pyspark.sql.functions import sum
from pyspark.sql.window import Window
windowSpec = Window.partitionBy(["Category A","Category B"])
df = df.withColumn('sum', sum(df.Value).over(windowSpec))
我没有得到我想要的结果,我得到的是累计和:
Category A | Category B | Value | Sum |
---|---|---|---|
1 | 2 | 100 | 100 |
1 | 2 | 150 | 250 |
2 | 1 | 110 | 110 |
2 | 2 | 200 | 200 |
我该如何继续?谢谢
定义 window 时,您可以为 window 指定 range。
您可以指定范围 (Window.unboundedPreceding, Window.unboundedFollowing)
以对每个分区内的所有行求和,而不考虑行的顺序:
windowSpec = Window.partitionBy(["Category A","Category B"])\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('sum', F.sum(df.Value).over(windowSpec))\
.orderBy("Category A", "Category B").show()
打印
+----------+----------+-----+-----+
|Category A|Category B|Value| sum|
+----------+----------+-----+-----+
| 1| 2| 100|250.0|
| 1| 2| 150|250.0|
| 2| 1| 110|110.0|
| 2| 2| 200|200.0|
+----------+----------+-----+-----+