Scala 中的汇总统计

summary statistics in scala

在这个例子中,我如何优雅地计算每个不同指标(名称)的每个组的汇总统计数据(例如平均方差)?

case class MeasureUnit(name: String, value: Double)

Seq(MeasureUnit("metric1", 0.04), MeasureUnit("metric1", 0.09),
  MeasureUnit("metric2", 0.64), MeasureUnit("metric2", 0.34), MeasureUnit("metric2", 0.84))

如何计算每个 属性 的均值/方差的一个很好的例子是 https://chrisbissell.wordpress.com/2011/05/23/a-simple-but-very-flexible-statistics-library-in-scala/ 但这不包括分组。

您可以使用Seq#groupBy

val measureSeq : Seq[MeasureUnit] = ???

type Name = String

// "metric1" -> Seq(0.04, 0.09), "metric2" -> Seq(0.64, 0.34, 0.84)
val groupedMeasures : Map[Name, Seq[Double]] = 
  measureSeq
    .groupBy(_.name)
    .mapValues(_ map (_.value))

然后可以使用这些分组来计算您的汇总统计信息:

type Mean = Double

val meanMapping : Map[Name, Mean] = 
  groupedMeasures mapValues { v => mean(v) }

type Variance = Double

val varianceMapping : Map[Name, Variance] = 
  groupedMeasures mapValues { v => variance(v) }

或者您可以将每个名称映射到一个统计元组:

type Summary = Tuple2[Mean, Variance]

val summaryMapping : Map[Name, Summary] = 
  groupedMeasures mapValues {s => (mean(s), variance(s)) }