.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) 错误 Spark Scala
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) Error Spark Scala
你好,我正在尝试将每个 window 的最后一个值扩展到 count
列的 window 的其余部分,以便创建一个标志来识别寄存器是否是 window 的最后一个值。我以这种方式尝试过但没有用。
样本 DF:
val df_197 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_197.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
val juntar_riesgo = 1
val var_entidad_2 = $"aux"
//Particionar por uno o dos campos en funcion del valor de la variable juntar_riesgo
//Se creará window_number_2 basado en este particionamiento
val winSpec = if(juntar_riesgo == 1) {
Window.partitionBy($"policyId").orderBy($"FECMVTO")
} else {
Window.partitionBy(var_entidad_2,$"policyId").orderBy("FECMVTO")
}
val df_308 = df_307.withColumn("window_number", row_number().over(winSpec))
.withColumn("count", last("window_number",true) over (winSpec))
.withColumn("FLG_LAST_WDW", when(col("window_number") === col("count"),1).otherwise(lit(0))).show
结果(第一个分区的所有元素的列数需要为 5,第二个分区的所有元素需要为 4):
+--------+-------+---+-------+-------------+-----+------------+
|policyId|FECMVTO|aux|IND_DEF|window_number|count|FLG_LAST_WDW|
+--------+-------+---+-------+-------------+-----+------------+
| 1| 1| 7| 10| 1| 1| 1|
| 1| 3| 14| 50| 2| 2| 1|
| 1| 10| 4| 300| 3| 3| 1|
| 1| 20| 24| 70| 4| 4| 1|
| 1| 30| 12| 90| 5| 5| 1|
| 2| 5| 10| 80| 1| 1| 1|
| 2| 10| 4| 900| 2| 2| 1|
| 2| 15| 21| 60| 3| 3| 1|
| 2| 25| 30| 40| 4| 4| 1|
+--------+-------+---+-------+-------------+-----+------------+
然后我读到当你在windowPartition
子句后使用orderBy
时,你必须指定子句.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
来实现我需要的。但是,当我尝试时,我遇到了这个错误:
val juntar_riesgo = 1
val var_entidad_2 = $"aux"
//Particionar por uno o dos campos en funcion del valor de la variable juntar_riesgo
//Se creará window_number_2 basado en este particionamiento
val winSpec = if(juntar_riesgo == 1) {
Window.partitionBy($"policyId").orderBy($"FECMVTO")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
} else {
Window.partitionBy(var_entidad_2,$"policyId").orderBy("FECMVTO")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
}
val df_198 = df_197.withColumn("window_number", row_number().over(winSpec))
.withColumn("count", last("window_number",true) over (winSpec))
.withColumn("FLG_LAST_WDW", when(col("window_number") === col("count"),1).otherwise(lit(0))).show
ERROR: org.apache.spark.sql.AnalysisException: Window Frame specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()) must match the required frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$());
感谢您的帮助!
你不应该在这里使用 last
但 max
没有 指定顺序:
val df_198 = df_197
.withColumn("window_number", row_number().over(Window.partitionBy($"policyId").orderBy($"FECMVTO")))
.withColumn("count", max("window_number") over (Window.partitionBy($"policyId")))
.withColumn("FLG_LAST_WDW", when(col("window_number") === col("count"),1).otherwise(lit(0))).show
+--------+-------+---+-------+-------------+-----+------------+
|policyId|FECMVTO|aux|IND_DEF|window_number|count|FLG_LAST_WDW|
+--------+-------+---+-------+-------------+-----+------------+
| 1| 1| 7| 10| 1| 5| 0|
| 1| 3| 14| 50| 2| 5| 0|
| 1| 10| 4| 300| 3| 5| 0|
| 1| 20| 24| 70| 4| 5| 0|
| 1| 30| 12| 90| 5| 5| 1|
| 2| 5| 10| 80| 1| 4| 0|
| 2| 10| 4| 900| 2| 4| 0|
| 2| 15| 21| 60| 3| 4| 0|
| 2| 25| 30| 40| 4| 4| 1|
+--------+-------+---+-------+-------------+-----+------------+
请注意,您可以通过按降序计算 row_number
然后取 row_number===1
:
来写得更短
val df_198 = df_197
.withColumn("FLG_LAT_WDW", when(row_number().over(Window.partitionBy($"policyId").orderBy($"FECMVTO".desc))===1,1).otherwise(0))
.show
你好,我正在尝试将每个 window 的最后一个值扩展到 count
列的 window 的其余部分,以便创建一个标志来识别寄存器是否是 window 的最后一个值。我以这种方式尝试过但没有用。
样本 DF:
val df_197 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_197.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
val juntar_riesgo = 1
val var_entidad_2 = $"aux"
//Particionar por uno o dos campos en funcion del valor de la variable juntar_riesgo
//Se creará window_number_2 basado en este particionamiento
val winSpec = if(juntar_riesgo == 1) {
Window.partitionBy($"policyId").orderBy($"FECMVTO")
} else {
Window.partitionBy(var_entidad_2,$"policyId").orderBy("FECMVTO")
}
val df_308 = df_307.withColumn("window_number", row_number().over(winSpec))
.withColumn("count", last("window_number",true) over (winSpec))
.withColumn("FLG_LAST_WDW", when(col("window_number") === col("count"),1).otherwise(lit(0))).show
结果(第一个分区的所有元素的列数需要为 5,第二个分区的所有元素需要为 4):
+--------+-------+---+-------+-------------+-----+------------+
|policyId|FECMVTO|aux|IND_DEF|window_number|count|FLG_LAST_WDW|
+--------+-------+---+-------+-------------+-----+------------+
| 1| 1| 7| 10| 1| 1| 1|
| 1| 3| 14| 50| 2| 2| 1|
| 1| 10| 4| 300| 3| 3| 1|
| 1| 20| 24| 70| 4| 4| 1|
| 1| 30| 12| 90| 5| 5| 1|
| 2| 5| 10| 80| 1| 1| 1|
| 2| 10| 4| 900| 2| 2| 1|
| 2| 15| 21| 60| 3| 3| 1|
| 2| 25| 30| 40| 4| 4| 1|
+--------+-------+---+-------+-------------+-----+------------+
然后我读到当你在windowPartition
子句后使用orderBy
时,你必须指定子句.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
来实现我需要的。但是,当我尝试时,我遇到了这个错误:
val juntar_riesgo = 1
val var_entidad_2 = $"aux"
//Particionar por uno o dos campos en funcion del valor de la variable juntar_riesgo
//Se creará window_number_2 basado en este particionamiento
val winSpec = if(juntar_riesgo == 1) {
Window.partitionBy($"policyId").orderBy($"FECMVTO")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
} else {
Window.partitionBy(var_entidad_2,$"policyId").orderBy("FECMVTO")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
}
val df_198 = df_197.withColumn("window_number", row_number().over(winSpec))
.withColumn("count", last("window_number",true) over (winSpec))
.withColumn("FLG_LAST_WDW", when(col("window_number") === col("count"),1).otherwise(lit(0))).show
ERROR: org.apache.spark.sql.AnalysisException: Window Frame specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()) must match the required frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$());
感谢您的帮助!
你不应该在这里使用 last
但 max
没有 指定顺序:
val df_198 = df_197
.withColumn("window_number", row_number().over(Window.partitionBy($"policyId").orderBy($"FECMVTO")))
.withColumn("count", max("window_number") over (Window.partitionBy($"policyId")))
.withColumn("FLG_LAST_WDW", when(col("window_number") === col("count"),1).otherwise(lit(0))).show
+--------+-------+---+-------+-------------+-----+------------+
|policyId|FECMVTO|aux|IND_DEF|window_number|count|FLG_LAST_WDW|
+--------+-------+---+-------+-------------+-----+------------+
| 1| 1| 7| 10| 1| 5| 0|
| 1| 3| 14| 50| 2| 5| 0|
| 1| 10| 4| 300| 3| 5| 0|
| 1| 20| 24| 70| 4| 5| 0|
| 1| 30| 12| 90| 5| 5| 1|
| 2| 5| 10| 80| 1| 4| 0|
| 2| 10| 4| 900| 2| 4| 0|
| 2| 15| 21| 60| 3| 4| 0|
| 2| 25| 30| 40| 4| 4| 1|
+--------+-------+---+-------+-------------+-----+------------+
请注意,您可以通过按降序计算 row_number
然后取 row_number===1
:
val df_198 = df_197
.withColumn("FLG_LAT_WDW", when(row_number().over(Window.partitionBy($"policyId").orderBy($"FECMVTO".desc))===1,1).otherwise(0))
.show