如何在 Scala SPARK 中的 groupBy 之后在 agg() 函数中找到分位数
How to find quantiles inside agg() function after groupBy in Scala SPARK
我有一个数据框,我想在其中对 A 列进行分组,然后找到不同的统计数据,例如均值、最小值、最大值、标准偏差和分位数。
我可以使用以下代码找到最小值、最大值和平均值:
df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)
但是我找不到分位数(0.25, 0.5, 0.75)。我尝试了 approxQuantile 和 percentile,但出现以下错误:
错误:未找到:值 approxQuantile
如果类路径中有 Hive,则可以使用许多 UDAF,如 percentile_approx 和 stddev_samp,请参阅 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
您可以使用 callUDF
调用这些函数:
import ss.implicits._
import org.apache.spark.sql.functions.callUDF
val df = Seq(1.0,2.0,3.0).toDF("x")
df.groupBy()
.agg(
callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
callUDF("stddev_samp",$"x").as("stdev")
)
.show()
这是我在 Spark 3.1 上测试过的代码
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
df.groupBy($"department")
.agg(
percentile_approx($"salary",lit(0.5), lit(10000))
)
.show(false)
输出
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
+----------+-------------------------------------+
|department|percentile_approx(salary, 0.5, 10000)|
+----------+-------------------------------------+
|Sales |86000 |
|Finance |83000 |
|Marketing |80000 |
+----------+-------------------------------------+
我有一个数据框,我想在其中对 A 列进行分组,然后找到不同的统计数据,例如均值、最小值、最大值、标准偏差和分位数。
我可以使用以下代码找到最小值、最大值和平均值:
df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)
但是我找不到分位数(0.25, 0.5, 0.75)。我尝试了 approxQuantile 和 percentile,但出现以下错误:
错误:未找到:值 approxQuantile
如果类路径中有 Hive,则可以使用许多 UDAF,如 percentile_approx 和 stddev_samp,请参阅 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
您可以使用 callUDF
调用这些函数:
import ss.implicits._
import org.apache.spark.sql.functions.callUDF
val df = Seq(1.0,2.0,3.0).toDF("x")
df.groupBy()
.agg(
callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
callUDF("stddev_samp",$"x").as("stdev")
)
.show()
这是我在 Spark 3.1 上测试过的代码
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
df.groupBy($"department")
.agg(
percentile_approx($"salary",lit(0.5), lit(10000))
)
.show(false)
输出
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
+----------+-------------------------------------+
|department|percentile_approx(salary, 0.5, 10000)|
+----------+-------------------------------------+
|Sales |86000 |
|Finance |83000 |
|Marketing |80000 |
+----------+-------------------------------------+