Pyspark - 根据时间和位置计算组中的平均速度
Pyspark - calculate average velocity in group based on time and location
我有以下 PYSPARK 数据框:
+-------------------+----+---------+------+
| timestamplast|ship| X_pos |time_d|
+-------------------+----+---------+------+
|2019-08-01 00:00:00| 1| 3 | 0 |
|2019-08-01 00:00:09| 1| 4 | 9 |
|2019-08-01 00:00:20| 1| 5 | 11 |
|2019-08-01 00:00:27| 1| 9 | 7 |
|2019-08-01 00:00:38| 2| 3 | 0 |
|2019-08-01 00:00:39| 2| 8 | 1 |
|2019-08-01 00:00:57| 2| 20 | 18 |
+-------------------+----+---------+------+
其中timestamplast是日期时间,time_d是组内的时间差"ship"(time_d在新的"ship"开始时为零。我想计算组内的平均速度 "ship" 并根据时间差和位置将结果附加到数据帧 X_pos
ship==1 的平均速度为:(1/9 + 1/11 + 4/7)/3 = 0.26 m/s。
ship==2 的平均速度为:(5/1 + 12/18 /2 = 2.83 m/s.
编辑:
ship==1 的平均速度为:((4-3)/(9) + (5-4)/(11) + (9-5)/(7))/3 = 0.26 m/s.
ship==2 的平均速度为:((8-3)/1 + ((20-8)/18)) /2 = 2.83 m/s.
结果应如下所示:
+-------------------+----+---------+------+-----------+
| timestamplast|name| X |time_d| avg_vel_x |
+-------------------+----+---------+------+-----------|
|2019-08-01 00:00:00| 1| 3 | 0 | 0.26 |
|2019-08-01 00:00:09| 1| 4 | 9 | 0.26 |
|2019-08-01 00:00:20| 1| 5 | 11 | 0.26 |
|2019-08-01 00:00:27| 1| 9 | 7 | 0.26 |
|2019-08-01 00:00:38| 2| 3 | 0 | 2.83 |
|2019-08-01 00:00:39| 2| 8 | 1 | 2.83 |
|2019-08-01 00:00:57| 2| 20 | 18 | 2.83 |
+-------------------+----+---------+------+-----------|
transform
in pandas 可以通过 pyspark 中的 windows 函数复制,类似于 sql , ship
== 1 的预期输出应该是0.26
代替。你可以试试:
import pyspark.sql.functions as F
w = Window.partitionBy('ship')
pct_change=((F.col("X_pos")-F.lag("X_pos").over(w.orderBy("timestamplast")))
/F.col("time_d"))
df.withColumn("avg_vel_x",F.round(F.sum(pct_change).over(w)
/(F.count("ship").over(w)-1),2)).show()
+-------------------+----+-----+------+---------+
| timestamplast|ship|X_pos|time_d|avg_vel_x|
+-------------------+----+-----+------+---------+
|2019-08-01 00:00:00| 1| 3| 0| 0.26|
|2019-08-01 00:00:09| 1| 4| 9| 0.26|
|2019-08-01 00:00:20| 1| 5| 11| 0.26|
|2019-08-01 00:00:27| 1| 9| 7| 0.26|
|2019-08-01 00:00:38| 2| 3| 0| 2.83|
|2019-08-01 00:00:39| 2| 8| 1| 2.83|
|2019-08-01 00:00:57| 2| 20| 18| 2.83|
+-------------------+----+-----+------+---------+
我有以下 PYSPARK 数据框:
+-------------------+----+---------+------+
| timestamplast|ship| X_pos |time_d|
+-------------------+----+---------+------+
|2019-08-01 00:00:00| 1| 3 | 0 |
|2019-08-01 00:00:09| 1| 4 | 9 |
|2019-08-01 00:00:20| 1| 5 | 11 |
|2019-08-01 00:00:27| 1| 9 | 7 |
|2019-08-01 00:00:38| 2| 3 | 0 |
|2019-08-01 00:00:39| 2| 8 | 1 |
|2019-08-01 00:00:57| 2| 20 | 18 |
+-------------------+----+---------+------+
其中timestamplast是日期时间,time_d是组内的时间差"ship"(time_d在新的"ship"开始时为零。我想计算组内的平均速度 "ship" 并根据时间差和位置将结果附加到数据帧 X_pos
ship==1 的平均速度为:(1/9 + 1/11 + 4/7)/3 = 0.26 m/s。 ship==2 的平均速度为:(5/1 + 12/18 /2 = 2.83 m/s.
编辑: ship==1 的平均速度为:((4-3)/(9) + (5-4)/(11) + (9-5)/(7))/3 = 0.26 m/s. ship==2 的平均速度为:((8-3)/1 + ((20-8)/18)) /2 = 2.83 m/s.
结果应如下所示:
+-------------------+----+---------+------+-----------+
| timestamplast|name| X |time_d| avg_vel_x |
+-------------------+----+---------+------+-----------|
|2019-08-01 00:00:00| 1| 3 | 0 | 0.26 |
|2019-08-01 00:00:09| 1| 4 | 9 | 0.26 |
|2019-08-01 00:00:20| 1| 5 | 11 | 0.26 |
|2019-08-01 00:00:27| 1| 9 | 7 | 0.26 |
|2019-08-01 00:00:38| 2| 3 | 0 | 2.83 |
|2019-08-01 00:00:39| 2| 8 | 1 | 2.83 |
|2019-08-01 00:00:57| 2| 20 | 18 | 2.83 |
+-------------------+----+---------+------+-----------|
transform
in pandas 可以通过 pyspark 中的 windows 函数复制,类似于 sql , ship
== 1 的预期输出应该是0.26
代替。你可以试试:
import pyspark.sql.functions as F
w = Window.partitionBy('ship')
pct_change=((F.col("X_pos")-F.lag("X_pos").over(w.orderBy("timestamplast")))
/F.col("time_d"))
df.withColumn("avg_vel_x",F.round(F.sum(pct_change).over(w)
/(F.count("ship").over(w)-1),2)).show()
+-------------------+----+-----+------+---------+
| timestamplast|ship|X_pos|time_d|avg_vel_x|
+-------------------+----+-----+------+---------+
|2019-08-01 00:00:00| 1| 3| 0| 0.26|
|2019-08-01 00:00:09| 1| 4| 9| 0.26|
|2019-08-01 00:00:20| 1| 5| 11| 0.26|
|2019-08-01 00:00:27| 1| 9| 7| 0.26|
|2019-08-01 00:00:38| 2| 3| 0| 2.83|
|2019-08-01 00:00:39| 2| 8| 1| 2.83|
|2019-08-01 00:00:57| 2| 20| 18| 2.83|
+-------------------+----+-----+------+---------+