排除 Redshift 中的重复记录
Exclude duplicates records that follows in Redshift
我有一个简单的 SQL 问题我无法解决(我正在使用 Amazon Redshift)。
假设我有以下示例:
id, type, channel, date, column1, column2, column3, column4
1, visit, seo, 07/08/2017: 11:11:22
1, hit, seo, 07/08/2017: 11:12:34
1, hit, seo, 07/08/2017: 11:13:22
1, visit, sem, 07/08/2017: 11:15:11
1, scarf, display, 07/08/2017: 11:15:45
1, hit, display, 07/08/2017: 11:15:37
1, hit, seo, 07/08/2017: 11:18:22
1, hit, display 07/08/2017: 11:18:23
1, hit, referal 07/08/2017: 11:19:55
我想 select 所有访问(在我的逻辑中 table 对应于与特定 ID 相关的每一行的开头,并且还排除 'channel' 重复一个接一个,我的例子应该 return :
1, visit, seo, 07/08/2017: 11:11:22
**1, hit, seo, 07/08/2017: 11:12:34** (exclude because it follows seo and it's not a visit)
**1, hit, seo, 07/08/2017: 11:13:22** (exclude because it follows seo and it's not a visit)
1, visit, sem, 07/08/2017: 11:15:11 (include, new channel)
1, scarf, display, 07/08/2017: 11:15:45 (include, new channel)
**1, hit, display, 07/08/2017: 11:15:37** (exclude because it follows display and it's not a visit)
1, hit, seo, 07/08/2017: 11:18:22 (include because it doesn't follow seo directly, even if seo is already present)
1, hit, display 07/08/2017: 11:18:23 ((include because it doesn't follow display directly, even if display is already present)
1, hit, referal 07/08/2017: 11:19:55 (include, new channel)
我试过使用行号(因为我正在使用 Redshift):
select type, date, id, ROW_NUMBER() OVER (PARTITION BY id, channel ORDER BY date) as rn
然后添加过滤器:
Where type='visit' or rn=1
但这并不能解决问题,因为它不会 return 第 7 行和第 8 行:
1, hit, seo, 07/08/2017: 11:18:22 (will be rn=4 for 'id=1, channel=seo' combination)
1, hit, display 07/08/2017: 11:18:23 (will be rn=3 for 'id=1, channel=display' combination)
谁能给我一个指示,以便我解决问题?
您可以使用 lag
仅 select 行,其中先前的渠道不同或类型是访问
select * from (
select * ,
lag(channel) over (partition by id, order by date) prev_channel
from mytable
) t where prev_channel <> channel or type = 'visit' or prev_channel is null
我有一个简单的 SQL 问题我无法解决(我正在使用 Amazon Redshift)。
假设我有以下示例:
id, type, channel, date, column1, column2, column3, column4
1, visit, seo, 07/08/2017: 11:11:22
1, hit, seo, 07/08/2017: 11:12:34
1, hit, seo, 07/08/2017: 11:13:22
1, visit, sem, 07/08/2017: 11:15:11
1, scarf, display, 07/08/2017: 11:15:45
1, hit, display, 07/08/2017: 11:15:37
1, hit, seo, 07/08/2017: 11:18:22
1, hit, display 07/08/2017: 11:18:23
1, hit, referal 07/08/2017: 11:19:55
我想 select 所有访问(在我的逻辑中 table 对应于与特定 ID 相关的每一行的开头,并且还排除 'channel' 重复一个接一个,我的例子应该 return :
1, visit, seo, 07/08/2017: 11:11:22
**1, hit, seo, 07/08/2017: 11:12:34** (exclude because it follows seo and it's not a visit)
**1, hit, seo, 07/08/2017: 11:13:22** (exclude because it follows seo and it's not a visit)
1, visit, sem, 07/08/2017: 11:15:11 (include, new channel)
1, scarf, display, 07/08/2017: 11:15:45 (include, new channel)
**1, hit, display, 07/08/2017: 11:15:37** (exclude because it follows display and it's not a visit)
1, hit, seo, 07/08/2017: 11:18:22 (include because it doesn't follow seo directly, even if seo is already present)
1, hit, display 07/08/2017: 11:18:23 ((include because it doesn't follow display directly, even if display is already present)
1, hit, referal 07/08/2017: 11:19:55 (include, new channel)
我试过使用行号(因为我正在使用 Redshift):
select type, date, id, ROW_NUMBER() OVER (PARTITION BY id, channel ORDER BY date) as rn
然后添加过滤器:
Where type='visit' or rn=1
但这并不能解决问题,因为它不会 return 第 7 行和第 8 行:
1, hit, seo, 07/08/2017: 11:18:22 (will be rn=4 for 'id=1, channel=seo' combination)
1, hit, display 07/08/2017: 11:18:23 (will be rn=3 for 'id=1, channel=display' combination)
谁能给我一个指示,以便我解决问题?
您可以使用 lag
仅 select 行,其中先前的渠道不同或类型是访问
select * from (
select * ,
lag(channel) over (partition by id, order by date) prev_channel
from mytable
) t where prev_channel <> channel or type = 'visit' or prev_channel is null