SQL (Presto) - 'compress' 行，当日期范围是连续的

Question

我有这个数据（样本）：

event_id    period_start    period_end  rating
100269      2/8/2016        6/30/2016   1
100269      6/30/2016       12/31/2016  1
100269      12/31/2016      6/30/2017   2
100269      6/30/2017       12/31/2017  2

我想 "compress" 当句点 (period_start、period_end) 立即连续且评级相同时的行。期望的输出将是：

event_id    period_start    period_end  rating
100269      2/8/2016        12/31/2016  1
100269      12/31/2016      12/31/2017  2

请注意，在此数据集中，并非所有时期对于某些 event_id 都是直接连续的。这是一个示例和所需的输出：

event_id    period_start    period_end  rating
100300      2/8/2016        6/30/2016   1
100300      6/30/2016       12/31/2016  1
100300      6/30/2017       12/31/2017  1

期望的输出：

event_id    period_start    period_end  rating
100300      2/8/2016        12/31/2016  1
100300      6/30/2017       12/31/2017  1

您可以通过测试前一行的 period_end 是否等于当前行的 period_start 来确定一个时间段是否直接连续（这在整个数据集中都是正确的，以识别直接连续的时间段）。

我认为这里有一个涉及 GROUP BY 的解决方案，但我没有看到它。任何帮助都会很棒。谢谢！

Answer 1

with a as (
    select *,
        case when lag(period_end) over (partition by event_id, rating order by period_start) = period_start
           then 0 else 1 end as brk
    from T
) b as (
    select *,
        sum(brk) over (partition by event_id, rating order by period_start) as grp
    from a
)
select event_id, min(period_start) as period_start, max(period_end) as period_end, rating
from b
group by event_id, grp, rating
order by event_id, grp, rating

确定系列中哪些行是中断行，将它们标记为 1。通过计算中断次数对组进行编号，总共运行。使用 group by 折叠成单行。

Answer 2

这是一个缺口和孤岛问题。关键思想是使用 lag() 找到值变化的位置，然后进行累加和分配组。

但是，我更喜欢 date 列而不是 value 列的延迟。事实证明，当您有多个可能更改的值时，这会方便得多。

在你的情况下，这看起来像：

select event_id, min(period_start), max(period_end), rating
from (select t.*,
             sum(case when prev_period_end = period_end then 0 else 1 end) over (partition by event_id order by period_start) as grp
      from (select t.*,
                   lag(period_end) over (partition by event_id, rating order by period_start) as prev_period_end
            from t
           ) t
     ) t
group by event_id, rating, grp;

SQL (Presto) - 'compress' 行，当日期范围是连续的

SQL (Presto) - 'compress' rows when date ranges are sequential

sql

presto