SQL (Hive):在使用 GROUP BY 聚合时使用 window 函数
SQL (Hive): using window functions while aggregating with GROUP BY
我在 Athena (Hive/Presto 中有以下 table:
CREATE EXTERNAL TABLE tmp (
id STRING,
updated_at TIMESTAMP,
location STRING,
direction STRING
)
LOCATION 's3://path';
我需要在 id
字段上进行聚合和计数,同时 select location
和 direction
相对于最新的 timestamp
在组内(分区再次在 id
上)。
到目前为止,我提出了以下查询,首先利用 window 函数,然后分组:
SELECT
b.id,
MAX(b.latest_location) AS "latest_location", -- It seems it is not possible to use first_value() on GROUP BY
MAX(b.latest_direction) AS "latest_direction",
COUNT(*) AS "total"
FROM (
SELECT
a.id,
first_value(a.location) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_location",
first_value(a.direction) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_direction"
FROM tmp a
) b
GROUP BY b.id;
我第一次尝试同时进行分组聚合和window聚合,但似乎引擎不允许这样做。是否可以编写更高效的查询(也许没有子查询)?
您可以混合使用 window 函数和聚合函数。 . .但在另一个方向:首先聚合,然后 window 函数。
也就是说,如果消除聚合,您的查询应该会快得多。只需使用 row_number()
和过滤:
SELECT a.id, a.location, a.updated_at
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS seqnum
FROM tmp a
) a
WHERE seqnum = 1;
SELECT DISTINCT
id,
first_value(a.location) OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_location,
first_value(a.direction) OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_direction,
count(*) OVER (PARTITION BY id) as total
FROM tmp
在您的原始查询中,max
基本上是一个虚拟聚合,因为所有行都具有相同的值。 group by
本质上是在做 distinct
在这里做的事情。
添加到首选答案 -- 考虑让您的 window 定义更正式,支持 DRY(不要重复自己)首选项:
SELECT DISTINCT
id,
first_value(a.location) OVER w AS latest_location,
first_value(a.direction) OVER w AS latest_direction,
count(*) OVER (PARTITION BY id) as total
FROM tmp
WINDOW w AS (PARTITION BY id ORDER BY updated_at DESC)
这将允许将更复杂的 window 定义准确地保留在一个地方,并保证相同的 window 逻辑用于两列计算。
我在 Athena (Hive/Presto 中有以下 table:
CREATE EXTERNAL TABLE tmp (
id STRING,
updated_at TIMESTAMP,
location STRING,
direction STRING
)
LOCATION 's3://path';
我需要在 id
字段上进行聚合和计数,同时 select location
和 direction
相对于最新的 timestamp
在组内(分区再次在 id
上)。
到目前为止,我提出了以下查询,首先利用 window 函数,然后分组:
SELECT
b.id,
MAX(b.latest_location) AS "latest_location", -- It seems it is not possible to use first_value() on GROUP BY
MAX(b.latest_direction) AS "latest_direction",
COUNT(*) AS "total"
FROM (
SELECT
a.id,
first_value(a.location) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_location",
first_value(a.direction) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_direction"
FROM tmp a
) b
GROUP BY b.id;
我第一次尝试同时进行分组聚合和window聚合,但似乎引擎不允许这样做。是否可以编写更高效的查询(也许没有子查询)?
您可以混合使用 window 函数和聚合函数。 . .但在另一个方向:首先聚合,然后 window 函数。
也就是说,如果消除聚合,您的查询应该会快得多。只需使用 row_number()
和过滤:
SELECT a.id, a.location, a.updated_at
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS seqnum
FROM tmp a
) a
WHERE seqnum = 1;
SELECT DISTINCT
id,
first_value(a.location) OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_location,
first_value(a.direction) OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_direction,
count(*) OVER (PARTITION BY id) as total
FROM tmp
在您的原始查询中,max
基本上是一个虚拟聚合,因为所有行都具有相同的值。 group by
本质上是在做 distinct
在这里做的事情。
添加到首选答案 -- 考虑让您的 window 定义更正式,支持 DRY(不要重复自己)首选项:
SELECT DISTINCT
id,
first_value(a.location) OVER w AS latest_location,
first_value(a.direction) OVER w AS latest_direction,
count(*) OVER (PARTITION BY id) as total
FROM tmp
WINDOW w AS (PARTITION BY id ORDER BY updated_at DESC)
这将允许将更复杂的 window 定义准确地保留在一个地方,并保证相同的 window 逻辑用于两列计算。