按 SAS 中的组与 SQL 中的 window 函数

Question

我有一个 table、input_table，其中包含以下列：

key - double
code - string
date- string
result -string

我有以下 SAS 代码：

PROC SQL;
    CREATE TABLE t1 AS
    SELECT key, code, date, result
    from input_table
    ORDER BY key, code, date;
QUIT;


DATA t1;
   SET t1;
   final_result= INPUT(result, 5.)
RUN;


DATA t2;
   SET t1;
   WHERE NOT MISSING(final_result)
   BY key, code, date

   IF LAST.code;
RUN;

所以根据我的理解，这段代码是添加列'final_result'，这是转换为数值的结果，如果包含非数值则为NULL。然后它选择具有每个键、代码对的最大日期的行。我试图在 hiveQL 中复制它（我认为对于这种情况，它与 SQL 几乎相同）：

select key, code, date, result, final_result 
from
(select *, 
 row_number() over (partition by key, code order by date desc) as rnk, 
 cast(result as double) as final_result
 from input_table
) x
where rnk=1 and final_result is not null

这个查询是否等同于上面的 SAS 代码？（我会测试自己，但目前遇到环境问题）

Answer 1

我看到的唯一主要问题是最后一行的最终结果可能是 null/missing。

在 SAS 中，where 子句发生在数据步骤处理之前，因此它实际上等同于将它放在 partition 语句中（我不确定这是否可能）and/or 在前面的步骤中。如果按日期排序的最后一行恰好为空，SAS 将跳过它并取最后一行 not 为空（因为它不会在数据流中开始).

在你的 SQL 中，如果 rank=1 碰巧有 final_result is null，它将被删除 - 但 rank=2 或任何行都不会保留（所以你在您的输出中没有任何行对应于该特定 key/code 组合。

select key, code, date, result, final_result 
from
(select s.*, 
 row_number() over (partition by key, code order by date desc) as rnk
 from (
   select *, cast(result as double) as final_Result
   from input_table 
   where final_Result is not null
   ) s
) x
where rnk=1

类似的东西应该是等价的。

我能看到的唯一其他潜在问题：如果您有两行日期完全相同，SAS 将选择输入数据集顺序中的 "last" 行。我不知道 Hive 会做什么；但是，在大多数 SQL 实现中，您应该假设您会随机获得一个，因为 SQL 不会尝试保留行顺序。

按 SAS 中的组与 SQL 中的 window 函数

By Groups in SAS vs window functions in SQL

sql

hive

sas

window-functions

hiveql