Pyspark - 根据时间和位置计算组中的平均速度

Pyspark - calculate average velocity in group based on time and location

我有以下 PYSPARK 数据框:

+-------------------+----+---------+------+
|      timestamplast|ship|  X_pos  |time_d|
+-------------------+----+---------+------+
|2019-08-01 00:00:00|   1|     3   |   0  |
|2019-08-01 00:00:09|   1|     4   |   9  |
|2019-08-01 00:00:20|   1|     5   |  11  |
|2019-08-01 00:00:27|   1|     9   |   7  |
|2019-08-01 00:00:38|   2|     3   |   0  |
|2019-08-01 00:00:39|   2|     8   |   1  |
|2019-08-01 00:00:57|   2|     20  |  18  |
+-------------------+----+---------+------+

其中timestamplast是日期时间,time_d是组内的时间差"ship"(time_d在新的"ship"开始时为零。我想计算组内的平均速度 "ship" 并根据时间差和位置将结果附加到数据帧 X_pos

ship==1 的平均速度为:(1/9 + 1/11 + 4/7)/3 = 0.26 m/s。 ship==2 的平均速度为:(5/1 + 12/18 /2 = 2.83 m/s.

编辑: ship==1 的平均速度为:((4-3)/(9) + (5-4)/(11) + (9-5)/(7))/3 = 0.26 m/s. ship==2 的平均速度为:((8-3)/1 + ((20-8)/18)) /2 = 2.83 m/s.

结果应如下所示:

+-------------------+----+---------+------+-----------+
|      timestamplast|name|     X   |time_d| avg_vel_x |
+-------------------+----+---------+------+-----------|
|2019-08-01 00:00:00|   1|     3   |   0  |     0.26  |
|2019-08-01 00:00:09|   1|     4   |   9  |     0.26  |
|2019-08-01 00:00:20|   1|     5   |  11  |     0.26  |
|2019-08-01 00:00:27|   1|     9   |   7  |     0.26  |
|2019-08-01 00:00:38|   2|     3   |   0  |     2.83  |
|2019-08-01 00:00:39|   2|     8   |   1  |     2.83  |
|2019-08-01 00:00:57|   2|     20  |  18  |     2.83  |
+-------------------+----+---------+------+-----------|

transform in pandas 可以通过 pyspark 中的 windows 函数复制,类似于 sql , ship == 1 的预期输出应该是0.26 代替。你可以试试:

import pyspark.sql.functions as F

w = Window.partitionBy('ship')
pct_change=((F.col("X_pos")-F.lag("X_pos").over(w.orderBy("timestamplast")))
                                                /F.col("time_d"))
df.withColumn("avg_vel_x",F.round(F.sum(pct_change).over(w)
                                 /(F.count("ship").over(w)-1),2)).show()

+-------------------+----+-----+------+---------+
|      timestamplast|ship|X_pos|time_d|avg_vel_x|
+-------------------+----+-----+------+---------+
|2019-08-01 00:00:00|   1|    3|     0|     0.26|
|2019-08-01 00:00:09|   1|    4|     9|     0.26|
|2019-08-01 00:00:20|   1|    5|    11|     0.26|
|2019-08-01 00:00:27|   1|    9|     7|     0.26|
|2019-08-01 00:00:38|   2|    3|     0|     2.83|
|2019-08-01 00:00:39|   2|    8|     1|     2.83|
|2019-08-01 00:00:57|   2|   20|    18|     2.83|
+-------------------+----+-----+------+---------+