没有 UDF 的 Spark 数据集的加权平均值
Weighted average with Spark Datasets without UDF
虽然有人已经问过计算 ,但在这个问题中,我问的是使用 Datasets/DataFrames 而不是 RDD。
如何在 Spark 中计算加权平均值?我有两列:计数和以前的平均值:
case class Stat(name:String, count: Int, average: Double)
val statset = spark.createDataset(Seq(Stat("NY", 1,5.0),
Stat("NY",2,1.5),
Stat("LA",12,1.0),
Stat("LA",15,3.0)))
我希望能够像这样计算加权平均值:
display(statset.groupBy($"name").agg(sum($"count").as("count"),
weightedAverage($"count",$"average").as("average")))
可以使用UDF接近:
val weightedAverage = udf(
(row:Row)=>{
val counts = row.getAs[WrappedArray[Int]](0)
val averages = row.getAs[WrappedArray[Double]](1)
val (count,total) = (counts zip averages).foldLeft((0,0.0)){
case((cumcount:Int,cumtotal:Double),(newcount:Int,newaverage:Double))=>(cumcount+newcount,cumtotal+newcount*newaverage)}
(total/count) // Tested by returning count here and then extracting. Got same result as sum.
}
)
display(statset.groupBy($"name").agg(sum($"count").as("count"),
weightedAverage(struct(collect_list($"count"),
collect_list($"average"))).as("average")))
(感谢 的回答帮助我写这篇文章)
新手:使用这些导入:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection.mutable.WrappedArray
有没有一种方法可以使用内置列函数而不是 UDF 来完成此操作? UDF 感觉很笨重,如果数字变大,您必须将 Int 转换为 Long。
看起来你可以分两次完成:
val totalCount = statset.select(sum($"count")).collect.head.getLong(0)
statset.select(lit(totalCount) as "count", sum($"average" * $"count" / lit(totalCount)) as "average").show
或者,包括您刚刚添加的 groupBy:
display(statset.groupBy($"name").agg(sum($"count").as("count"),
sum($"count"*$"average").as("total"))
.select($"name",$"count",($"total"/$"count")))
虽然有人已经问过计算
如何在 Spark 中计算加权平均值?我有两列:计数和以前的平均值:
case class Stat(name:String, count: Int, average: Double)
val statset = spark.createDataset(Seq(Stat("NY", 1,5.0),
Stat("NY",2,1.5),
Stat("LA",12,1.0),
Stat("LA",15,3.0)))
我希望能够像这样计算加权平均值:
display(statset.groupBy($"name").agg(sum($"count").as("count"),
weightedAverage($"count",$"average").as("average")))
可以使用UDF接近:
val weightedAverage = udf(
(row:Row)=>{
val counts = row.getAs[WrappedArray[Int]](0)
val averages = row.getAs[WrappedArray[Double]](1)
val (count,total) = (counts zip averages).foldLeft((0,0.0)){
case((cumcount:Int,cumtotal:Double),(newcount:Int,newaverage:Double))=>(cumcount+newcount,cumtotal+newcount*newaverage)}
(total/count) // Tested by returning count here and then extracting. Got same result as sum.
}
)
display(statset.groupBy($"name").agg(sum($"count").as("count"),
weightedAverage(struct(collect_list($"count"),
collect_list($"average"))).as("average")))
(感谢
新手:使用这些导入:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection.mutable.WrappedArray
有没有一种方法可以使用内置列函数而不是 UDF 来完成此操作? UDF 感觉很笨重,如果数字变大,您必须将 Int 转换为 Long。
看起来你可以分两次完成:
val totalCount = statset.select(sum($"count")).collect.head.getLong(0)
statset.select(lit(totalCount) as "count", sum($"average" * $"count" / lit(totalCount)) as "average").show
或者,包括您刚刚添加的 groupBy:
display(statset.groupBy($"name").agg(sum($"count").as("count"),
sum($"count"*$"average").as("total"))
.select($"name",$"count",($"total"/$"count")))