SQL (Presto) - 'compress' 行,当日期范围是连续的
SQL (Presto) - 'compress' rows when date ranges are sequential
我有这个数据(样本):
event_id period_start period_end rating
100269 2/8/2016 6/30/2016 1
100269 6/30/2016 12/31/2016 1
100269 12/31/2016 6/30/2017 2
100269 6/30/2017 12/31/2017 2
我想 "compress" 当句点 (period_start
、period_end
) 立即连续且评级相同时的行。期望的输出将是:
event_id period_start period_end rating
100269 2/8/2016 12/31/2016 1
100269 12/31/2016 12/31/2017 2
请注意,在此数据集中,并非所有时期对于某些 event_id
都是直接连续的。这是一个示例和所需的输出:
event_id period_start period_end rating
100300 2/8/2016 6/30/2016 1
100300 6/30/2016 12/31/2016 1
100300 6/30/2017 12/31/2017 1
期望的输出:
event_id period_start period_end rating
100300 2/8/2016 12/31/2016 1
100300 6/30/2017 12/31/2017 1
您可以通过测试前一行的 period_end
是否等于当前行的 period_start
来确定一个时间段是否直接连续(这在整个数据集中都是正确的,以识别直接连续的时间段)。
我 认为 这里有一个涉及 GROUP BY
的解决方案,但我没有看到它。任何帮助都会很棒。谢谢!
with a as (
select *,
case when lag(period_end) over (partition by event_id, rating order by period_start) = period_start
then 0 else 1 end as brk
from T
) b as (
select *,
sum(brk) over (partition by event_id, rating order by period_start) as grp
from a
)
select event_id, min(period_start) as period_start, max(period_end) as period_end, rating
from b
group by event_id, grp, rating
order by event_id, grp, rating
确定系列中哪些行是中断行,将它们标记为 1。通过计算中断次数对组进行编号,总共 运行。使用 group by
折叠成单行。
这是一个缺口和孤岛问题。关键思想是使用 lag()
找到值变化的位置,然后进行累加和分配组。
但是,我更喜欢 date 列而不是 value 列的延迟。事实证明,当您有多个可能更改的值时,这会方便得多。
在你的情况下,这看起来像:
select event_id, min(period_start), max(period_end), rating
from (select t.*,
sum(case when prev_period_end = period_end then 0 else 1 end) over (partition by event_id order by period_start) as grp
from (select t.*,
lag(period_end) over (partition by event_id, rating order by period_start) as prev_period_end
from t
) t
) t
group by event_id, rating, grp;
我有这个数据(样本):
event_id period_start period_end rating
100269 2/8/2016 6/30/2016 1
100269 6/30/2016 12/31/2016 1
100269 12/31/2016 6/30/2017 2
100269 6/30/2017 12/31/2017 2
我想 "compress" 当句点 (period_start
、period_end
) 立即连续且评级相同时的行。期望的输出将是:
event_id period_start period_end rating
100269 2/8/2016 12/31/2016 1
100269 12/31/2016 12/31/2017 2
请注意,在此数据集中,并非所有时期对于某些 event_id
都是直接连续的。这是一个示例和所需的输出:
event_id period_start period_end rating
100300 2/8/2016 6/30/2016 1
100300 6/30/2016 12/31/2016 1
100300 6/30/2017 12/31/2017 1
期望的输出:
event_id period_start period_end rating
100300 2/8/2016 12/31/2016 1
100300 6/30/2017 12/31/2017 1
您可以通过测试前一行的 period_end
是否等于当前行的 period_start
来确定一个时间段是否直接连续(这在整个数据集中都是正确的,以识别直接连续的时间段)。
我 认为 这里有一个涉及 GROUP BY
的解决方案,但我没有看到它。任何帮助都会很棒。谢谢!
with a as (
select *,
case when lag(period_end) over (partition by event_id, rating order by period_start) = period_start
then 0 else 1 end as brk
from T
) b as (
select *,
sum(brk) over (partition by event_id, rating order by period_start) as grp
from a
)
select event_id, min(period_start) as period_start, max(period_end) as period_end, rating
from b
group by event_id, grp, rating
order by event_id, grp, rating
确定系列中哪些行是中断行,将它们标记为 1。通过计算中断次数对组进行编号,总共 运行。使用 group by
折叠成单行。
这是一个缺口和孤岛问题。关键思想是使用 lag()
找到值变化的位置,然后进行累加和分配组。
但是,我更喜欢 date 列而不是 value 列的延迟。事实证明,当您有多个可能更改的值时,这会方便得多。
在你的情况下,这看起来像:
select event_id, min(period_start), max(period_end), rating
from (select t.*,
sum(case when prev_period_end = period_end then 0 else 1 end) over (partition by event_id order by period_start) as grp
from (select t.*,
lag(period_end) over (partition by event_id, rating order by period_start) as prev_period_end
from t
) t
) t
group by event_id, rating, grp;