特定列的百分位数
Percentile over a specific column
我有以下数据框。
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
创建者
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
这里我有一个基本列 M 并且根据 Amount 排名为 ID .
我正在尝试计算保持 M 作为一个组的百分位数,但对于 Amount.
的每最后三个值
我正在使用 来查找一组的百分位数。但是我如何定位最后三个值。 ?
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
预期输出
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
这对我来说似乎有点棘手,因为我仍在学习 spark.Expecting 这里的爱好者的答案。
将 orderBy("Amount")
和 rowsBetween(-2,0)
添加到 Window 定义得到所需的结果:
- orderBy 按数量
对每个组中的行进行排序
- rowsBetween计算百分位数时只考虑当前行和前两行
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
打印
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+
我有以下数据框。
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
创建者
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
这里我有一个基本列 M 并且根据 Amount 排名为 ID . 我正在尝试计算保持 M 作为一个组的百分位数,但对于 Amount.
的每最后三个值我正在使用
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
预期输出
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
这对我来说似乎有点棘手,因为我仍在学习 spark.Expecting 这里的爱好者的答案。
将 orderBy("Amount")
和 rowsBetween(-2,0)
添加到 Window 定义得到所需的结果:
- orderBy 按数量 对每个组中的行进行排序
- rowsBetween计算百分位数时只考虑当前行和前两行
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
打印
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+