Spark Hive - 具有 window 函数的 UDFArgumentTypeException？

Question

我有以下 df:

+------------+----------------------+-------------------+                                 
|increment_id|base_subtotal_incl_tax|          eventdate|                                 
+------------+----------------------+-------------------+                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|             1570.0000|2015-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
+------------+----------------------+-------------------+

我正在尝试运行一个 window 函数作为：

WindowSpec window = Window.partitionBy(df.col("id")).orderBy(df.col("eventdate").desc());
df.select(df.col("*"),rank().over(window).alias("rank")) //error for this line
         .filter("rank <= 2")
         .show();

我想得到的是每个用户的最后两个条目（最后一个是最新日期，但由于它是按降序排列的，前两行）：

+------------+----------------------+-------------------+                                 
|increment_id|base_subtotal_incl_tax|          eventdate|                                 
+------------+----------------------+-------------------+                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2016-06-14 09:54:12|   
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                     
+------------+----------------------+-------------------+

但我明白了：

+------------+----------------------+-------------------+----+
|increment_id|base_subtotal_incl_tax|          eventdate|rank|                            
+------------+----------------------+-------------------+----+                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        1086|            14470.0000|2016-06-14 09:54:12|   1|                            
|        1086|            14470.0000|2016-06-14 09:54:12|   1|                            
+------------+----------------------+-------------------+----+

我错过了什么？

[OLD] - 原来是我出错了，现在解决了：

WindowSpec window = Window.partitionBy(df.col("id"));
df.select(df.col("*"),rank().over(window).alias("rank")) //error for this line
         .filter("rank <= 2")
         .show();

但是 returns 上面用注释标记的行的错误 Exception in thread "main" org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected.。我错过了什么？这个错误是什么意思？谢谢！

Answer 1

rank window 函数需要 window 和 orderBy，子句例如：

WindowSpec window = Window.partitionBy(df.col("id")).orderBy(df.col("payment"));

如果没有订单，它就毫无意义，因此会出现错误。

Spark Hive - 具有 window 函数的 UDFArgumentTypeException？

Spark Hive - UDFArgumentTypeException with window function?

java

hive

window-functions

apache-spark

apache-spark-sql