Hive HQL - 优化重复 WINDOW 子句
Hive HQL - optimizing repetitive WINDOW clause
我有以下 HQL
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
min(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) minTime,
max(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) maxTime
FROM t21_pam6
如何将 3 个相同的 WINDOW 个子句定义为一个?
文档(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
) 显示此示例
SELECT a, SUM(b) OVER w
FROM T;
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING)
但我认为它不起作用。无法定义 WINDOW w 因为...不是 HQL 命令。
这种类型的优化是编译器需要做的事情。我认为没有办法以编程方式确保这一点。
也就是说,最小时间的计算是完全没有必要的。因为order by
,应该是当前行的时间。同样,如果可以处理 null
值,那么表达式可以简化为:
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
event.time as minTime,
lead(event.time, 2) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time) as maxTime
FROM t21_pam6;
请注意,maxtime
计算略有不同,因为它将 return NULL
用于匹配条件的最后两个值。
正如@sergey-khudyakov 所回应的,文档中存在错误。此变体工作正常:
SELECT count(*) OVER w,
min(event.time) OVER w,
max(event.time) OVER w
FROM ar3.t21_pam6
WINDOW w AS (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
我有以下 HQL
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
min(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) minTime,
max(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) maxTime
FROM t21_pam6
如何将 3 个相同的 WINDOW 个子句定义为一个?
文档(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics ) 显示此示例
SELECT a, SUM(b) OVER w
FROM T;
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING)
但我认为它不起作用。无法定义 WINDOW w 因为...不是 HQL 命令。
这种类型的优化是编译器需要做的事情。我认为没有办法以编程方式确保这一点。
也就是说,最小时间的计算是完全没有必要的。因为order by
,应该是当前行的时间。同样,如果可以处理 null
值,那么表达式可以简化为:
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
event.time as minTime,
lead(event.time, 2) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time) as maxTime
FROM t21_pam6;
请注意,maxtime
计算略有不同,因为它将 return NULL
用于匹配条件的最后两个值。
正如@sergey-khudyakov 所回应的,文档中存在错误。此变体工作正常:
SELECT count(*) OVER w,
min(event.time) OVER w,
max(event.time) OVER w
FROM ar3.t21_pam6
WINDOW w AS (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)