根据多列和日期时间删除重复项
Remove duplicates based on multiple columns and datetime
我想根据较早的日期时间删除具有相同 visitor_id 的重复行。例如,对于 visitor_id 2643331144,我想选择第 1 行,因为它具有较早的访问日期时间,并且还要将频道和 visit_page 保留在同一行。对于 visitor_id 1092581226,我想保留第 3 行。
rowno
visitor_id
datetime
channel
visit_page
1
2643331144
10/3/2021 4:05:29 PM
email
landing page
2
2643331144
10/3/2021 4:05:39 PM
organic search
landing page
3
1092581226
10/7/2021 1:08:12 PM
email
price reduced
4
1092581226
10/7/2021 1:08:44 PM
organic search
landing page
5
1092581226
10/7/2021 1:09:04 PM
paid search
unknow
6
1092581226
10/7/2021 1:09:05 PM
email
price reduced
我想要如下所示的结果:
rowno
visitor_id
datetime
channel
visit_page
1
2643331144
10/3/2021 4:05:29 PM
email
landing page
2
1092581226
10/7/2021 1:08:12 PM
email
price reduced
我使用了下面的查询,但访问者总数被过度删除了。但如果不使用分区,总数将被重复计算,因为同一访问者在同一会话期间有多个频道和页面。
with T as
(select *, row_number() over (partition by visitor_id order by datetime asc) as rank
from table A)
select distinct visitor_id, channel, visit_page
from T
where rank=1
如果唯一的问题是最终输出中的 rownum,您可以在最终 select:
中使用 row_number() over (order by datetime asc) as rownum
“重新计算”它
with cte (
visitor_id
,datetime
,channel
,visit_page
) as (
values
(2643331144,'10/3/2021 4:05:29 PM','email','landing page'),
(2643331144,'10/3/2021 4:05:39 PM','organic search','landing page'),
(1092581226,'10/7/2021 1:08:12 PM','email','price reduced'),
(1092581226,'10/7/2021 1:08:44 PM','organic search','landing page'),
(1092581226,'10/7/2021 1:09:04 PM','paid search','unknow'),
(1092581226,'10/7/2021 1:09:05 PM','email','price reduced')
)
select row_number() over (order by datetime asc) as rownum,
visitor_id,
datetime,
channel,
visit_page
from (
-- inlined your WITH clause into subquery
select *,
row_number() over (
partition by visitor_id
order by datetime asc
) as rank
from cte
)
where rank = 1
输出:
rownum
visitor_id
datetime
channel
visit_page
1
2643331144
10/3/2021 4:05:29 PM
email
landing page
2
1092581226
10/7/2021 1:08:12 PM
email
price reduced
我想根据较早的日期时间删除具有相同 visitor_id 的重复行。例如,对于 visitor_id 2643331144,我想选择第 1 行,因为它具有较早的访问日期时间,并且还要将频道和 visit_page 保留在同一行。对于 visitor_id 1092581226,我想保留第 3 行。
rowno | visitor_id | datetime | channel | visit_page |
---|---|---|---|---|
1 | 2643331144 | 10/3/2021 4:05:29 PM | landing page | |
2 | 2643331144 | 10/3/2021 4:05:39 PM | organic search | landing page |
3 | 1092581226 | 10/7/2021 1:08:12 PM | price reduced | |
4 | 1092581226 | 10/7/2021 1:08:44 PM | organic search | landing page |
5 | 1092581226 | 10/7/2021 1:09:04 PM | paid search | unknow |
6 | 1092581226 | 10/7/2021 1:09:05 PM | price reduced |
我想要如下所示的结果:
rowno | visitor_id | datetime | channel | visit_page |
---|---|---|---|---|
1 | 2643331144 | 10/3/2021 4:05:29 PM | landing page | |
2 | 1092581226 | 10/7/2021 1:08:12 PM | price reduced |
我使用了下面的查询,但访问者总数被过度删除了。但如果不使用分区,总数将被重复计算,因为同一访问者在同一会话期间有多个频道和页面。
with T as
(select *, row_number() over (partition by visitor_id order by datetime asc) as rank
from table A)
select distinct visitor_id, channel, visit_page
from T
where rank=1
如果唯一的问题是最终输出中的 rownum,您可以在最终 select:
中使用row_number() over (order by datetime asc) as rownum
“重新计算”它
with cte (
visitor_id
,datetime
,channel
,visit_page
) as (
values
(2643331144,'10/3/2021 4:05:29 PM','email','landing page'),
(2643331144,'10/3/2021 4:05:39 PM','organic search','landing page'),
(1092581226,'10/7/2021 1:08:12 PM','email','price reduced'),
(1092581226,'10/7/2021 1:08:44 PM','organic search','landing page'),
(1092581226,'10/7/2021 1:09:04 PM','paid search','unknow'),
(1092581226,'10/7/2021 1:09:05 PM','email','price reduced')
)
select row_number() over (order by datetime asc) as rownum,
visitor_id,
datetime,
channel,
visit_page
from (
-- inlined your WITH clause into subquery
select *,
row_number() over (
partition by visitor_id
order by datetime asc
) as rank
from cte
)
where rank = 1
输出:
rownum | visitor_id | datetime | channel | visit_page |
---|---|---|---|---|
1 | 2643331144 | 10/3/2021 4:05:29 PM | landing page | |
2 | 1092581226 | 10/7/2021 1:08:12 PM | price reduced |