在 MySQL 中对重叠的数据范围进行分组

Group overlapping ranges of data in MySQL

有没有简单的方法避免使用游标来转换它:

+-------+------+-------+
| Group | From | Until |
+-------+------+-------+
| X     | 1    | 3     |
+-------+------+-------+
| X     | 2    | 4     |
+-------+------+-------+
| Y     | 5    | 7     |
+-------+------+-------+
| X     | 8    | 10    |
+-------+------+-------+
| Y     | 11   | 12    |
+-------+------+-------+
| Y     | 12   | 13    |
+-------+------+-------+

进入这个:

+-------+------+-------+
| Group | From | Until |
+-------+------+-------+
| X     | 1    | 4     |
+-------+------+-------+
| Y     | 5    | 7     |
+-------+------+-------+
| X     | 8    | 10    |
+-------+------+-------+
| Y     | 11   | 13    |
+-------+------+-------+

到目前为止,我已经尝试为每一行分配一个 ID 并按该 ID 进行分组,但如果不使用游标我就无法更进一步。

如果您使用的是 MYSQL 版本 8+,那么您可以使用 row_number 来获得所需的结果:

Demo

  SELECT MIN(`FROM`) START, 
MAX(`UNTIL`) END, 
`GROUP` FROM (
SELECT A.*, 
ROW_NUMBER() OVER(ORDER BY `FROM`) RN_FROM,
ROW_NUMBER() OVER(PARTITION BY `GROUP` ORDER BY `UNTIL`) RN_UNTIL
FROM Table_lag A) X  
GROUP BY `GROUP`, (RN_FROM - RN_UNTIL) 
ORDER BY START;
SELECT `Group`, `From`, `Until`
FROM ( SELECT `Group`, `From`, ROW_NUMBER() OVER (PARTITION BY `Group` ORDER BY `From`) rn
       FROM test t1
       WHERE NOT EXISTS ( SELECT NULL
                          FROM test t2
                          WHERE t1.`From` > t2.`From`
                            AND t1.`From` <= t2.`Until`
                            AND t1.`Group` = t2.`Group` ) ) t3
JOIN ( SELECT `Group`, `Until`, ROW_NUMBER() OVER (PARTITION BY `Group` ORDER BY `From`) rn
       FROM test t1
       WHERE NOT EXISTS ( SELECT NULL
                          FROM test t2
                          WHERE t1.`Until` >= t2.`From`
                            AND t1.`Until` < t2.`Until`
                            AND t1.`Group` = t2.`Group` ) ) t4 USING (`Group`, rn)

fiddle

必须适用于任何重叠类型(部分重叠、相邻、完全包含)。

如果 From and/or Until 为 NULL,将不起作用。



Could you add an explanation in English? – ysth

第一个子查询搜索连接范围开始(参见 fiddle - 它单独执行) - 它在不在任何 middle/end 中的组中搜索 From 值其他范围(允许起点相等)。

第二个子查询对连接的范围执行相同的操作 Until

两者都另外枚举找到的值。

外部查询只是将每个范围的开始和结束连接成一行。

我会为此使用递归 CTE:

with recursive intervals (`Group`, `From`, `Until`) as (
    select distinct t1.Group, t1.From, t1.Until
    from Table_lag t1
    where not exists (
        select 1
        from Table_lag t2
        where t1.Group=t2.Group
        and t1.From between t2.From and t2.Until+1
        and (t1.From,t1.Until) <> (t2.From,t2.Until)
    )
    union all
    select t1.Group, t1.From, t2.Until
    from intervals t1
    join Table_lag t2
        on t2.Group=t1.Group
        and t2.From between t1.From and t1.Until+1
        and t2.Until > t1.Until
)
select `Group`, `From`, max(`Until`) as Until
from intervals
group by `Group`, `From`
order by `From`, `Group`;

锚表达式 (select .. where not exists (...)) 找到所有组 & from 不会与之前的一些 from 组合(因此在我们的最终输出中每一行都有一行):

然后递归查询为每一行的合并间隔添加行。

然后逐组并从(那些糟糕的列名)获得最大的
每个开始的间隔 group/from.

https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9efa508504b80e44b73c952572394b76

或者,您可以使用一组简单的连接和子查询来实现,不需要 CTE 或 window 函数:

select
    interval_start_range.grp,
    interval_start_range.start,
    max(merged.finish) finish
from (
    select
        interval_start.grp,
        interval_start.start,
        min(later_interval_start.start) next_start
    from (
        select distinct t1.grp, t1.start, t1.finish
        from Table_lag t1
        where not exists (
            select 1
            from Table_lag t2
            where t1.grp=t2.grp
            and t1.start between t2.start and t2.finish+1
            and (t1.start,t1.finish) <> (t2.start,t2.finish)
        )
    ) interval_start
    left join (
        select distinct t1.grp, t1.start, t1.finish
        from Table_lag t1
        where not exists (
            select 1
            from Table_lag t2
            where t1.grp=t2.grp
            and t1.start between t2.start and t2.finish+1
            and (t1.start,t1.finish) <> (t2.start,t2.finish)
        )
    ) later_interval_start
        on interval_start.grp=later_interval_start.grp
        and interval_start.start < later_interval_start.start
    group by interval_start.grp, interval_start.start
) as interval_start_range
join Table_lag merged
    on merged.grp=interval_start_range.grp
    and merged.start >= interval_start_range.start
    and (interval_start_range.next_start is null or merged.start < interval_start_range.next_start)
group by interval_start_range.grp, interval_start_range.start
order by interval_start_range.start, interval_start_range.grp

(我已将此处的列重命名为不需要反引号。)

这里有一个 select 来获取我们将报告的可报告间隔的所有开始,加入另一个类似的 select (您可以使用 CTE 来避免冗余)找到以下内容同一组的可报告间隔的开始(如果有的话)。它包装在子查询中以获取组、起始值和以下可报告间隔的起始值。然后它只需要加入该范围内开始的所有其他记录并选择最大结束值。

https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=151cc933489c299f7beefa99e1959549

您只能使用 window 函数,使用一些 gaps-and-island 技术来做到这一点。

我们的想法是使用 lag() 和 window sum() 构建一组具有相同组和重叠范围的连续记录。然后您可以聚合这些组:

select grp, min(c_from) c_from, max(c_until) c_until
from (
    select
        t.*,
        sum(lag_c_until < c_from) over(partition by grp order by c_from) mygrp
    from (
        select
            t.*,
            lag(c_until, 1, c_until) over(partition by grp order by c_from) lag_c_until
        from mytable t
    ) t
) t
group by grp, mygrp

您选择的列名称与 SQL 关键字(groupfrom)冲突,因此我将它们重命名为 grpc_fromc_until.

Demo on DB Fiddle - with credits to ysth 首先创建 fiddle:

grp | c_from | c_until
:-- | -----: | ------:
X   |      1 |       4
Y   |      5 |       7
X   |      8 |      10
Y   |     11 |      13