Spark 数据集 - 将查询聚合到 BigInt 总和为零

Spark Dataset -Aggregate query to sum of BigInt sums as zero

我有一个 ExpenseEntry 类型的数据集。 ExpenseEntry 是一个基本的数据结构,用于跟踪每个 category

花费的 amount
case class ExpenseEntry(
    name: String,
    category: String,
    amount: BigDecimal
)

示例值 -

ExpenseEntry("John", "candy", 0.5)
ExpenseEntry("Tia", "game", 0.25)
ExpenseEntry("John", "candy", 0.15)
ExpenseEntry("Tia", "candy", 0.55)

预期答案是,

category - name - amount
candy - John - 0.65
candy - Tia - 0.55
game - Tia - 0.25

我想要做的是,获取每个名称的每个原因花费的总金额。所以,我有以下数据集查询

dataset.groupBy("category", "name").agg(sum("amount"))

从理论上讲,这个问题对我来说似乎是正确的。但是,总和显示为 0E-18 而 0。我猜测金额正在 sum 函数内转换为 int。如何将其转换为 BigInt?我对这个问题的理解正确吗?

package spark

import org.apache.spark.sql.{DataFrame, SparkSession}

object SumBig extends App{

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  case class ExpenseEntry(
                           name: String,
                           category: String,
                           amount: BigDecimal
                         )
  val df = Seq(
  ExpenseEntry("John", "candy", 0.5),
  ExpenseEntry("Tia", "game", 0.25),
  ExpenseEntry("John", "candy", 0.15),
  ExpenseEntry("Tia", "candy", 0.55)
  ).toDF()

  df.show(false)

  val r = df.groupBy("category", "name").sum("amount")
  r.show(false)

//      +--------+----+--------------------+
//      |category|name|sum(amount)         |
//      +--------+----+--------------------+
//      |game    |Tia |0.250000000000000000|
//      |candy   |John|0.650000000000000000|
//      |candy   |Tia |0.550000000000000000|
//      +--------+----+--------------------+

}
  1. 您可以使用 bound() 来限制小数位数
  2. Sum 不会将列的数据类型从 decimal 更改为 int。
df.groupBy("category", "name").agg(  sum(bround( col("amount"),2) ).as("sum_amount")).show()