使用 pyspark 计算日期之间的平均差异

Question

我有一个看起来像这样的数据框 - activity 的用户 ID 和日期。我需要使用 RDD 函数（例如 reduce 和 map）而不是 SQL.

来计算日期之间的平均差异

每个ID的日期在计算差异之前需要按顺序排序，因为我需要每个连续日期之间的差异。

ID	Date
1	2020-09-03
1	2020-09-03
2	2020-09-02
1	2020-09-04
2	2020-09-06
2	2020-09-16

此示例所需的结果将是：

ID	average difference
1	0.5
2	7

感谢帮助！

Answer 1

您可以使用datediff和window函数计算差值，然后取平均值。

lag 是 window 函数之一，它将从 window.

中的前一行中获取值

from pyspark.sql import functions as F

# define the window
w = Window.partitionBy('ID').orderBy('Date')

# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
  .groupby('ID')    # aggregate over ID
  .agg(F.avg(F.col('diff')).alias('average difference'))
)

使用 pyspark 计算日期之间的平均差异

calculate average difference between dates using pyspark

apache-spark

pyspark