pyspark：使用时间序列数据的滚动平均值

Question

我有一个由时间戳列和美元列组成的数据集。我想找到以每行的时间戳结束的每周平均美元数。我最初是在查看 pyspark.sql.functions.window 函数，但它按周对数据进行分类。

这是一个例子：

%pyspark
import datetime
from pyspark.sql import functions as F

df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"])
df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp'))

w = df2.groupBy(F.window("timestampGMT", "7 days")).agg(F.avg("dollars").alias('avg'))
w.select(w.window.start.cast("string").alias("start"), w.window.end.cast("string").alias("end"), "avg").collect()

这会产生两条记录：

|        start        |          end         | avg |
|---------------------|----------------------|-----|
|'2017-03-16 00:00:00'| '2017-03-23 00:00:00'| 21.0|
|---------------------|----------------------|-----|
|'2017-03-09 00:00:00'| '2017-03-16 00:00:00'| 15.0|
|---------------------|----------------------|-----|

window 函数对时间序列数据进行分箱，而不是执行滚动平均。

有没有一种方法可以执行滚动平均值，我会得到每行的每周平均值，时间段结束于该行的时间戳 GMT？

编辑：

下面张的回答接近我想要的，但不完全是我想看到的。

这里有一个更好的例子来展示我想要达到的目的：

%pyspark
from pyspark.sql import functions as F
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
                        (13, "2017-03-15T12:27:18+00:00"),
                        (25, "2017-03-18T11:27:18+00:00")],
                        ["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', F.avg("dollars").over(Window.partitionBy(F.window("timestampGMT", "7 days"))))

这会产生以下数据框：

dollars timestampGMT            rolling_average
25      2017-03-18 11:27:18.0   25
17      2017-03-10 15:27:18.0   15
13      2017-03-15 12:27:18.0   15

我希望平均值是在 timestampGMT 列中的日期之前的一周内，这将导致：

dollars timestampGMT            rolling_average
17      2017-03-10 15:27:18.0   17
13      2017-03-15 12:27:18.0   15
25      2017-03-18 11:27:18.0   19

在上面的结果中，2017-03-10 的 rolling_average 是 17，因为没有前面的记录。 2017-03-15 的 rolling_average 是 15，因为它是 2017-03-15 的 13 和 2017-03-10 的 17 的平均值，后者属于前 7 天 window。 2017-03-18 的滚动平均值是 19，因为它是 2017-03-18 的 25 和 2017-03-10 的 13 的平均值，它们落在前 7 天 window，并且不包括2017-03-10 的 17，因为它不属于前 7 天 window.

有没有办法做到这一点，而不是在每周 windows 不重叠的情况下进行装箱 window？

Answer 1

你是这个意思吗:

df = spark.createDataFrame([(17, "2017-03-11T15:27:18+00:00"),
                            (13, "2017-03-11T12:27:18+00:00"),
                            (21, "2017-03-17T11:27:18+00:00")],
                           ["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', f.avg("dollars").over(Window.partitionBy(f.window("timestampGMT", "7 days"))))

输出：

+-------+-------------------+---------------+                                   
|dollars|timestampGMT       |rolling_average|
+-------+-------------------+---------------+
|21     |2017-03-17 19:27:18|21.0           |
|17     |2017-03-11 23:27:18|15.0           |
|13     |2017-03-11 20:27:18|15.0           |
+-------+-------------------+---------------+

Answer 2

我找到了使用这个 Whosebug 计算 moving/rolling 平均值的正确方法：

基本思路是将时间戳列转换为秒，然后您可以使用 pyspark.sql.Window class 中的 rangeBetween 函数将正确的行包含在 window 中。

这是解决的例子：

%pyspark
from pyspark.sql import functions as F
from pyspark.sql.window import Window


#function to calculate number of seconds from number of days
days = lambda i: i * 86400

df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
                        (13, "2017-03-15T12:27:18+00:00"),
                        (25, "2017-03-18T11:27:18+00:00")],
                        ["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))

#create window by casting timestamp to long (number of seconds)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))

df = df.withColumn('rolling_average', F.avg("dollars").over(w))

这会产生我正在寻找的滚动平均值的确切列：

dollars   timestampGMT            rolling_average
17        2017-03-10 15:27:18.0   17.0
13        2017-03-15 12:27:18.0   15.0
25        2017-03-18 11:27:18.0   19.0

Answer 3

值得注意的是，如果您不关心确切日期 - 但关心最近 30 天的平均值，您可以使用 rowsBetween 函数，如下所示：

w = Window.orderBy('timestampGMT').rowsBetween(-7, 0)

df = eurPrices.withColumn('rolling_average', F.avg('dollars').over(w))

由于您按日期排序，因此需要最后 7 次。你保存所有的铸造。

Answer 4

我将添加一个我个人认为非常有用的变体。我希望有人也会发现它有用：

如果要分组，则在各个组内计算移动平均值：

数据帧示例：

from pyspark.sql.window import Window
from pyspark.sql import functions as func


df = spark.createDataFrame([("tshilidzi", 17.00, "2018-03-10T15:27:18+00:00"), 
  ("tshilidzi", 13.00, "2018-03-11T12:27:18+00:00"),   
  ("tshilidzi", 25.00, "2018-03-12T11:27:18+00:00"), 
  ("thabo", 20.00, "2018-03-13T15:27:18+00:00"), 
  ("thabo", 56.00, "2018-03-14T12:27:18+00:00"), 
  ("thabo", 99.00, "2018-03-15T11:27:18+00:00"), 
  ("tshilidzi", 156.00, "2019-03-22T11:27:18+00:00"), 
  ("thabo", 122.00, "2018-03-31T11:27:18+00:00"), 
  ("tshilidzi", 7000.00, "2019-04-15T11:27:18+00:00"),
  ("ash", 9999.00, "2018-04-16T11:27:18+00:00") 
  ],
  ["name", "dollars", "timestampGMT"])

# we need this timestampGMT as seconds for our Window time frame
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))

df.show(10000, False)

输出：

+---------+-------+---------------------+
|name     |dollars|timestampGMT         |
+---------+-------+---------------------+
|tshilidzi|17.0   |2018-03-10 17:27:18.0|
|tshilidzi|13.0   |2018-03-11 14:27:18.0|
|tshilidzi|25.0   |2018-03-12 13:27:18.0|
|thabo    |20.0   |2018-03-13 17:27:18.0|
|thabo    |56.0   |2018-03-14 14:27:18.0|
|thabo    |99.0   |2018-03-15 13:27:18.0|
|tshilidzi|156.0  |2019-03-22 13:27:18.0|
|thabo    |122.0  |2018-03-31 13:27:18.0|
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|
|ash      |9999.0 |2018-04-16 13:27:18.0|
+---------+-------+---------------------+

要根据 name 计算移动平均线并仍然保持所有行：

#create window by casting timestamp to long (number of seconds)
w = (Window()
     .partitionBy(col("name"))
     .orderBy(F.col("timestampGMT").cast('long'))
     .rangeBetween(-days(7), 0))

df2 = df.withColumn('rolling_average', F.avg("dollars").over(w))

df2.show(100, False)

输出：

+---------+-------+---------------------+------------------+
|name     |dollars|timestampGMT         |rolling_average   |
+---------+-------+---------------------+------------------+
|ash      |9999.0 |2018-04-16 13:27:18.0|9999.0            |
|tshilidzi|17.0   |2018-03-10 17:27:18.0|17.0              |
|tshilidzi|13.0   |2018-03-11 14:27:18.0|15.0              |
|tshilidzi|25.0   |2018-03-12 13:27:18.0|18.333333333333332|
|tshilidzi|156.0  |2019-03-22 13:27:18.0|156.0             |
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|7000.0            |
|thabo    |20.0   |2018-03-13 17:27:18.0|20.0              |
|thabo    |56.0   |2018-03-14 14:27:18.0|38.0              |
|thabo    |99.0   |2018-03-15 13:27:18.0|58.333333333333336|
|thabo    |122.0  |2018-03-31 13:27:18.0|122.0             |
+---------+-------+---------------------+------------------+

pyspark：使用时间序列数据的滚动平均值

pyspark: rolling average using timeseries data

moving-average

window-functions

apache-spark

pyspark