如何在 Presto 中进行重复数据删除
How to deduplicate in Presto
我有一个 Presto table 假设它有 [id, name, update_time] 列和数据
(1, Amy, 2018-08-01),
(1, Amy, 2018-08-02),
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)
现在,我想执行 sql,结果将是
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)
目前,我在 Presto 中进行重复数据删除的最佳方法如下。
select
t1.id,
t1.name,
t1.update_time
from table_name t1
join (select id, max(update_time) as update_time from table_name group by id) t2
on t1.id = t2.id and t1.update_time = t2.update_time
更多信息,点击deduplication in sql
在 Presto 中是否有更好的去重方法?
你似乎想要 subquery
:
select t.*
from table t
where update_time = (select MAX(t1.update_time) from table t1 where t1.id = t.id);
很简单:
Select id, name, MAX(update_time) as [Last Update] from table_name Group by id
希望对您有所帮助
只需使用 in
运算符
select t.*
from tableA t
where update_time in (select MAX(tableA.update_time) from tableA goup by id)
在 PrestoDB 中,我倾向于使用 row_number()
:
select id, name, date
from (select t.*,
row_number() over (partition by name order by date desc) as seqnum
from table_name t
) t
where seqnum = 1;
这是另一种方式
WITH latestDate AS (SELECT id,max(date) as latestDate FROM table_name GROUP BY id)
SELECT id,name,date FROM table_name t INNER JOIN latestDate l ON t.id = l.id AND t.date = l.latestDate
我有一个 Presto table 假设它有 [id, name, update_time] 列和数据
(1, Amy, 2018-08-01),
(1, Amy, 2018-08-02),
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)
现在,我想执行 sql,结果将是
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)
目前,我在 Presto 中进行重复数据删除的最佳方法如下。
select
t1.id,
t1.name,
t1.update_time
from table_name t1
join (select id, max(update_time) as update_time from table_name group by id) t2
on t1.id = t2.id and t1.update_time = t2.update_time
更多信息,点击deduplication in sql
在 Presto 中是否有更好的去重方法?
你似乎想要 subquery
:
select t.*
from table t
where update_time = (select MAX(t1.update_time) from table t1 where t1.id = t.id);
很简单:
Select id, name, MAX(update_time) as [Last Update] from table_name Group by id
希望对您有所帮助
只需使用 in
运算符
select t.*
from tableA t
where update_time in (select MAX(tableA.update_time) from tableA goup by id)
在 PrestoDB 中,我倾向于使用 row_number()
:
select id, name, date
from (select t.*,
row_number() over (partition by name order by date desc) as seqnum
from table_name t
) t
where seqnum = 1;
这是另一种方式
WITH latestDate AS (SELECT id,max(date) as latestDate FROM table_name GROUP BY id)
SELECT id,name,date FROM table_name t INNER JOIN latestDate l ON t.id = l.id AND t.date = l.latestDate