时间序列中的 Pyspark 滚动总和,同时将顺序日期保持在行中
Pyspark Rolling Sum in timeseries while keeping the sequential date in row
在 PySpark 中,我想通过时间序列日期添加滚动总和 X 的列,同时保持所有日期可用,即使其中一些日期为零。
原始数据框:
ColA ColB ColC Date(d-m-y) Value
A A1 AA1 1-1-2021 1
A A1 AA1 2-1-2021 2
A A1 AA1 3-1-2021 3
A A1 AA1 4-1-2021 4
A A1 AA1 5-1-2021 0
A A1 AA1 6-1-2021 0
B B1 AB1 1-1-2021 5
B B1 AB1 2-1-2021 6
B B1 AB1 3-1-2021 7
B B1 AB1 4-1-2021 8
B B1 AB1 5-1-2021 9
B B1 AB1 6-1-2021 10
预期的数据帧
ColA ColB ColC Date(d-m-y) Value Rolling2day Rolling4day
A A1 AA1 1-1-2021 1 1 1 (1)
A A1 AA1 2-1-2021 2 3 3 (1+2)
A A1 AA1 3-1-2021 3 5 6 (1+2+3)
A A1 AA1 4-1-2021 4 7 10(1+2+3+4)
A A1 AA1 5-1-2021 0 4 9 (2+3+4+0)
A A1 AA1 6-1-2021 0 0 7 (3+4+0+0)
B B1 AB1 1-1-2021 5 5 5
B B1 AB1 2-1-2021 6 11 11
B B1 AB1 3-1-2021 7 13 18
B B1 AB1 4-1-2021 8 15 26
B B1 AB1 5-1-2021 9 17 30(6+7+8+9)
B B1 AB1 6-1-2021 10 19 34 (7+8+9+10)
import pandas as pd
df1 = pd.read_csv("./your-localpath/tmp.csv")
df = spark.createDataFrame(df1)
df.selectExpr("*","sum(Value) over(partition by ColA,ColB,ColC order by `Date(d-m-y)` rows between 1 preceding and current row) as Rolling2day").show()
+----+----+----+-----------+-----+-----------+
|ColA|ColB|ColC|Date(d-m-y)|Value|Rolling2day|
+----+----+----+-----------+-----+-----------+
| A| A1| AA1| 1-1-2021| 1| 1|
| A| A1| AA1| 2-1-2021| 2| 3|
| A| A1| AA1| 3-1-2021| 3| 5|
| A| A1| AA1| 4-1-2021| 4| 7|
| A| A1| AA1| 5-1-2021| 0| 4|
| A| A1| AA1| 6-1-2021| 0| 0|
| B| B1| AB1| 1-1-2021| 5| 5|
| B| B1| AB1| 2-1-2021| 6| 11|
| B| B1| AB1| 3-1-2021| 7| 13|
| B| B1| AB1| 4-1-2021| 8| 15|
| B| B1| AB1| 5-1-2021| 9| 17|
| B| B1| AB1| 6-1-2021| 10| 19|
+----+----+----+-----------+-----+-----------+
如果你想滚四天,1改3就是你想要的
在 PySpark 中,我想通过时间序列日期添加滚动总和 X 的列,同时保持所有日期可用,即使其中一些日期为零。
原始数据框:
ColA ColB ColC Date(d-m-y) Value
A A1 AA1 1-1-2021 1
A A1 AA1 2-1-2021 2
A A1 AA1 3-1-2021 3
A A1 AA1 4-1-2021 4
A A1 AA1 5-1-2021 0
A A1 AA1 6-1-2021 0
B B1 AB1 1-1-2021 5
B B1 AB1 2-1-2021 6
B B1 AB1 3-1-2021 7
B B1 AB1 4-1-2021 8
B B1 AB1 5-1-2021 9
B B1 AB1 6-1-2021 10
预期的数据帧
ColA ColB ColC Date(d-m-y) Value Rolling2day Rolling4day
A A1 AA1 1-1-2021 1 1 1 (1)
A A1 AA1 2-1-2021 2 3 3 (1+2)
A A1 AA1 3-1-2021 3 5 6 (1+2+3)
A A1 AA1 4-1-2021 4 7 10(1+2+3+4)
A A1 AA1 5-1-2021 0 4 9 (2+3+4+0)
A A1 AA1 6-1-2021 0 0 7 (3+4+0+0)
B B1 AB1 1-1-2021 5 5 5
B B1 AB1 2-1-2021 6 11 11
B B1 AB1 3-1-2021 7 13 18
B B1 AB1 4-1-2021 8 15 26
B B1 AB1 5-1-2021 9 17 30(6+7+8+9)
B B1 AB1 6-1-2021 10 19 34 (7+8+9+10)
import pandas as pd
df1 = pd.read_csv("./your-localpath/tmp.csv")
df = spark.createDataFrame(df1)
df.selectExpr("*","sum(Value) over(partition by ColA,ColB,ColC order by `Date(d-m-y)` rows between 1 preceding and current row) as Rolling2day").show()
+----+----+----+-----------+-----+-----------+
|ColA|ColB|ColC|Date(d-m-y)|Value|Rolling2day|
+----+----+----+-----------+-----+-----------+
| A| A1| AA1| 1-1-2021| 1| 1|
| A| A1| AA1| 2-1-2021| 2| 3|
| A| A1| AA1| 3-1-2021| 3| 5|
| A| A1| AA1| 4-1-2021| 4| 7|
| A| A1| AA1| 5-1-2021| 0| 4|
| A| A1| AA1| 6-1-2021| 0| 0|
| B| B1| AB1| 1-1-2021| 5| 5|
| B| B1| AB1| 2-1-2021| 6| 11|
| B| B1| AB1| 3-1-2021| 7| 13|
| B| B1| AB1| 4-1-2021| 8| 15|
| B| B1| AB1| 5-1-2021| 9| 17|
| B| B1| AB1| 6-1-2021| 10| 19|
+----+----+----+-----------+-----+-----------+
如果你想滚四天,1改3就是你想要的