在 Presto SQL 的 where 子句中使用 Max()

Using Max() in where clause in Presto SQL

我有以下 table.

ID Desc progress updated_time
1 abcd planned 2022-04-20 10:00AM
1 abcd planned 2022-04-25 12:00AM
1 abcd in progress 2022-04-26 4:00PM
1 abcd in progress 2022-05-04 11:00AM
1 abcd in progress 2022-05-06 12:00PM

我只想 return 具有最新 updated_time 的行,不管它的进度如何,即,

ID Desc progress updated_time
1 abcd in progress 2022-05-06 12:00PM

我知道如果我按 'progress' 分组(如下所示),我也会得到一个我不需要的计划。我只需要每个 ID 的一行及其最新更新时间。

我写了下面的查询,

select ID,desc,progress,updated_time 
from t1 
where updated_time IN (select ID, desc, progress, max(updated_time) 
from t1 group by 1,2,3)

我也收到以下错误, 'Multiple columns returned by subquery are not yet supported'

您正在尝试将单个值与多个列匹配,但 yhis 引发错误..

看着你为你的目标编写代码,而不是基于子查询的 IN 子句,你应该使用内部连接

select ID,desc,progress,updated_time 
from t1 
INNER JOIN 
( select ID, desc, progress, max(updated_time) max_time 
from t1 group by 1,2,3) t on t.max_time = t1.updated_time

我可能会为此使用 row_number 或其他一些排名函数。

with t as (select a.*,
 row_number() over (partition by id order by updated_time desc as rn) 
select * from t where rn = 1

在子查询中选择多个值将不起作用,您需要 select 单个值 scalar subquery:

-- sample data
WITH dataset (ID, Desc, progress, updated_time) AS (
    VALUES 
(1, 'abcd', 'planned',  timestamp '2022-04-20 10:00'),
(1, 'abcd', 'planned',  timestamp '2022-04-25 12:00'),
(1, 'abcd', 'in progress',  timestamp '2022-04-26 16:00'),
(1, 'abcd', 'in progress',  timestamp '2022-05-04 11:00'),
(1, 'abcd', 'in progress',  timestamp '2022-05-06 12:00'),
(1, 'abcd', 'in progress',  timestamp '2022-05-07 12:00'),
(2, 'abcd', 'in progress',  timestamp '2022-05-04 11:00'),
(2, 'abcd', 'in progress',  timestamp '2022-05-06 12:00')
) 

--query
select  id, Desc, progress, updated_time
from dataset o
where updated_time = (select max(updated_time) from dataset i where i.id = o.id)

或使用 max window 函数和 subselect 的类似方法:

--query
select  id, Desc, progress, updated_time
from (
    select *,  max(updated_time) over (partition by id) max_time
    from dataset
)
where max_time = updated_time

或者只使用 row_number:

select  id, Desc, progress, updated_time
from 
(
    select *,  
        row_number() over(partition by id order by updated_time desc) rank
    from dataset
)
where rank  = 1

输出:

id Desc progress updated_time
1 abcd in progress 2022-05-07 12:00:00.000
2 abcd in progress 2022-05-06 12:00:00.000