使用 scala / spark 计算数据框列中每一行的 z 分数

Question

我有一个数据框：

val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}

val cord = DF.select("time","longitude", "latitude","speed")

我想计算每行速度的 z 分数（x 均值）/std column.I 计算均值和标准差：

val std = DF.select(col("speed").cast("double")).as[Double].rdd.stdev()
val mean = DF.select(col("speed").cast("double")).as[Double].rdd.mean()

如何计算每行列速度的z分数并得到这个结果：

+----------------+----------------+-   
|A               |B               |speed    |   z score      
+----------------+----------------+---------------------+
|17/02/2020 00:06|      -7.1732833|   50    |    z score     

|17/02/2020 00:16|      -7.1732833|   40    |    z score   

|17/02/2020 00:26|      -7.1732833|   30    |    z score

如何为每一行计算它。

Answer 1

执行此操作的最佳方法是：

df.withColumn("z score", col("speed") - mean / std)

其中均值和标准差的计算方法如问题中所示。

如果有帮助请告诉我！！

Answer 2

您可以避免使用来自 Hive 聚合函数的 window-函数和 STDDEV_POP 的两个单独的 RDD 操作：

val DF = {spark.read.option("header", value = true).option("delimiter", ";").csv(path_file)}

val cord = DF.select($"time",$"longitude", $"latitude",$"speed".cast("double"))

val result = cord
  .withColumn("mean",avg($"speed").over())
  .withColumn("stddev",callUDF("stddev_pop",$"speed").over())
  .withColumn("z-score",($"speed"-$"mean")/$"stddev")

使用 scala / spark 计算数据框列中每一行的 z 分数

calculate the z score for each row in the column of a dataframe using scala / spark

scala

apache-spark

hypothesis-test