在 MySQL 中对重叠的数据范围进行分组
Group overlapping ranges of data in MySQL
有没有简单的方法避免使用游标来转换它:
+-------+------+-------+
| Group | From | Until |
+-------+------+-------+
| X | 1 | 3 |
+-------+------+-------+
| X | 2 | 4 |
+-------+------+-------+
| Y | 5 | 7 |
+-------+------+-------+
| X | 8 | 10 |
+-------+------+-------+
| Y | 11 | 12 |
+-------+------+-------+
| Y | 12 | 13 |
+-------+------+-------+
进入这个:
+-------+------+-------+
| Group | From | Until |
+-------+------+-------+
| X | 1 | 4 |
+-------+------+-------+
| Y | 5 | 7 |
+-------+------+-------+
| X | 8 | 10 |
+-------+------+-------+
| Y | 11 | 13 |
+-------+------+-------+
到目前为止,我已经尝试为每一行分配一个 ID 并按该 ID 进行分组,但如果不使用游标我就无法更进一步。
如果您使用的是 MYSQL 版本 8+,那么您可以使用 row_number 来获得所需的结果:
SELECT MIN(`FROM`) START,
MAX(`UNTIL`) END,
`GROUP` FROM (
SELECT A.*,
ROW_NUMBER() OVER(ORDER BY `FROM`) RN_FROM,
ROW_NUMBER() OVER(PARTITION BY `GROUP` ORDER BY `UNTIL`) RN_UNTIL
FROM Table_lag A) X
GROUP BY `GROUP`, (RN_FROM - RN_UNTIL)
ORDER BY START;
SELECT `Group`, `From`, `Until`
FROM ( SELECT `Group`, `From`, ROW_NUMBER() OVER (PARTITION BY `Group` ORDER BY `From`) rn
FROM test t1
WHERE NOT EXISTS ( SELECT NULL
FROM test t2
WHERE t1.`From` > t2.`From`
AND t1.`From` <= t2.`Until`
AND t1.`Group` = t2.`Group` ) ) t3
JOIN ( SELECT `Group`, `Until`, ROW_NUMBER() OVER (PARTITION BY `Group` ORDER BY `From`) rn
FROM test t1
WHERE NOT EXISTS ( SELECT NULL
FROM test t2
WHERE t1.`Until` >= t2.`From`
AND t1.`Until` < t2.`Until`
AND t1.`Group` = t2.`Group` ) ) t4 USING (`Group`, rn)
必须适用于任何重叠类型(部分重叠、相邻、完全包含)。
如果 From
and/or Until
为 NULL,将不起作用。
Could you add an explanation in English? – ysth
第一个子查询搜索连接范围开始(参见 fiddle - 它单独执行) - 它在不在任何 middle/end 中的组中搜索 From
值其他范围(允许起点相等)。
第二个子查询对连接的范围执行相同的操作 Until
。
两者都另外枚举找到的值。
外部查询只是将每个范围的开始和结束连接成一行。
我会为此使用递归 CTE:
with recursive intervals (`Group`, `From`, `Until`) as (
select distinct t1.Group, t1.From, t1.Until
from Table_lag t1
where not exists (
select 1
from Table_lag t2
where t1.Group=t2.Group
and t1.From between t2.From and t2.Until+1
and (t1.From,t1.Until) <> (t2.From,t2.Until)
)
union all
select t1.Group, t1.From, t2.Until
from intervals t1
join Table_lag t2
on t2.Group=t1.Group
and t2.From between t1.From and t1.Until+1
and t2.Until > t1.Until
)
select `Group`, `From`, max(`Until`) as Until
from intervals
group by `Group`, `From`
order by `From`, `Group`;
锚表达式 (select .. where not exists (...)
) 找到所有组 & from 不会与之前的一些 from 组合(因此在我们的最终输出中每一行都有一行):
然后递归查询为每一行的合并间隔添加行。
然后逐组并从(那些糟糕的列名)获得最大的
每个开始的间隔 group/from.
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9efa508504b80e44b73c952572394b76
或者,您可以使用一组简单的连接和子查询来实现,不需要 CTE 或 window 函数:
select
interval_start_range.grp,
interval_start_range.start,
max(merged.finish) finish
from (
select
interval_start.grp,
interval_start.start,
min(later_interval_start.start) next_start
from (
select distinct t1.grp, t1.start, t1.finish
from Table_lag t1
where not exists (
select 1
from Table_lag t2
where t1.grp=t2.grp
and t1.start between t2.start and t2.finish+1
and (t1.start,t1.finish) <> (t2.start,t2.finish)
)
) interval_start
left join (
select distinct t1.grp, t1.start, t1.finish
from Table_lag t1
where not exists (
select 1
from Table_lag t2
where t1.grp=t2.grp
and t1.start between t2.start and t2.finish+1
and (t1.start,t1.finish) <> (t2.start,t2.finish)
)
) later_interval_start
on interval_start.grp=later_interval_start.grp
and interval_start.start < later_interval_start.start
group by interval_start.grp, interval_start.start
) as interval_start_range
join Table_lag merged
on merged.grp=interval_start_range.grp
and merged.start >= interval_start_range.start
and (interval_start_range.next_start is null or merged.start < interval_start_range.next_start)
group by interval_start_range.grp, interval_start_range.start
order by interval_start_range.start, interval_start_range.grp
(我已将此处的列重命名为不需要反引号。)
这里有一个 select 来获取我们将报告的可报告间隔的所有开始,加入另一个类似的 select (您可以使用 CTE 来避免冗余)找到以下内容同一组的可报告间隔的开始(如果有的话)。它包装在子查询中以获取组、起始值和以下可报告间隔的起始值。然后它只需要加入该范围内开始的所有其他记录并选择最大结束值。
https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=151cc933489c299f7beefa99e1959549
您只能使用 window 函数,使用一些 gaps-and-island 技术来做到这一点。
我们的想法是使用 lag()
和 window sum()
构建一组具有相同组和重叠范围的连续记录。然后您可以聚合这些组:
select grp, min(c_from) c_from, max(c_until) c_until
from (
select
t.*,
sum(lag_c_until < c_from) over(partition by grp order by c_from) mygrp
from (
select
t.*,
lag(c_until, 1, c_until) over(partition by grp order by c_from) lag_c_until
from mytable t
) t
) t
group by grp, mygrp
您选择的列名称与 SQL 关键字(group
、from
)冲突,因此我将它们重命名为 grp
、c_from
和 c_until
.
Demo on DB Fiddle - with credits to ysth 首先创建 fiddle:
grp | c_from | c_until
:-- | -----: | ------:
X | 1 | 4
Y | 5 | 7
X | 8 | 10
Y | 11 | 13
有没有简单的方法避免使用游标来转换它:
+-------+------+-------+
| Group | From | Until |
+-------+------+-------+
| X | 1 | 3 |
+-------+------+-------+
| X | 2 | 4 |
+-------+------+-------+
| Y | 5 | 7 |
+-------+------+-------+
| X | 8 | 10 |
+-------+------+-------+
| Y | 11 | 12 |
+-------+------+-------+
| Y | 12 | 13 |
+-------+------+-------+
进入这个:
+-------+------+-------+
| Group | From | Until |
+-------+------+-------+
| X | 1 | 4 |
+-------+------+-------+
| Y | 5 | 7 |
+-------+------+-------+
| X | 8 | 10 |
+-------+------+-------+
| Y | 11 | 13 |
+-------+------+-------+
到目前为止,我已经尝试为每一行分配一个 ID 并按该 ID 进行分组,但如果不使用游标我就无法更进一步。
如果您使用的是 MYSQL 版本 8+,那么您可以使用 row_number 来获得所需的结果:
SELECT MIN(`FROM`) START,
MAX(`UNTIL`) END,
`GROUP` FROM (
SELECT A.*,
ROW_NUMBER() OVER(ORDER BY `FROM`) RN_FROM,
ROW_NUMBER() OVER(PARTITION BY `GROUP` ORDER BY `UNTIL`) RN_UNTIL
FROM Table_lag A) X
GROUP BY `GROUP`, (RN_FROM - RN_UNTIL)
ORDER BY START;
SELECT `Group`, `From`, `Until`
FROM ( SELECT `Group`, `From`, ROW_NUMBER() OVER (PARTITION BY `Group` ORDER BY `From`) rn
FROM test t1
WHERE NOT EXISTS ( SELECT NULL
FROM test t2
WHERE t1.`From` > t2.`From`
AND t1.`From` <= t2.`Until`
AND t1.`Group` = t2.`Group` ) ) t3
JOIN ( SELECT `Group`, `Until`, ROW_NUMBER() OVER (PARTITION BY `Group` ORDER BY `From`) rn
FROM test t1
WHERE NOT EXISTS ( SELECT NULL
FROM test t2
WHERE t1.`Until` >= t2.`From`
AND t1.`Until` < t2.`Until`
AND t1.`Group` = t2.`Group` ) ) t4 USING (`Group`, rn)
必须适用于任何重叠类型(部分重叠、相邻、完全包含)。
如果 From
and/or Until
为 NULL,将不起作用。
Could you add an explanation in English? – ysth
第一个子查询搜索连接范围开始(参见 fiddle - 它单独执行) - 它在不在任何 middle/end 中的组中搜索 From
值其他范围(允许起点相等)。
第二个子查询对连接的范围执行相同的操作 Until
。
两者都另外枚举找到的值。
外部查询只是将每个范围的开始和结束连接成一行。
我会为此使用递归 CTE:
with recursive intervals (`Group`, `From`, `Until`) as (
select distinct t1.Group, t1.From, t1.Until
from Table_lag t1
where not exists (
select 1
from Table_lag t2
where t1.Group=t2.Group
and t1.From between t2.From and t2.Until+1
and (t1.From,t1.Until) <> (t2.From,t2.Until)
)
union all
select t1.Group, t1.From, t2.Until
from intervals t1
join Table_lag t2
on t2.Group=t1.Group
and t2.From between t1.From and t1.Until+1
and t2.Until > t1.Until
)
select `Group`, `From`, max(`Until`) as Until
from intervals
group by `Group`, `From`
order by `From`, `Group`;
锚表达式 (select .. where not exists (...)
) 找到所有组 & from 不会与之前的一些 from 组合(因此在我们的最终输出中每一行都有一行):
然后递归查询为每一行的合并间隔添加行。
然后逐组并从(那些糟糕的列名)获得最大的
每个开始的间隔 group/from.
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9efa508504b80e44b73c952572394b76
或者,您可以使用一组简单的连接和子查询来实现,不需要 CTE 或 window 函数:
select
interval_start_range.grp,
interval_start_range.start,
max(merged.finish) finish
from (
select
interval_start.grp,
interval_start.start,
min(later_interval_start.start) next_start
from (
select distinct t1.grp, t1.start, t1.finish
from Table_lag t1
where not exists (
select 1
from Table_lag t2
where t1.grp=t2.grp
and t1.start between t2.start and t2.finish+1
and (t1.start,t1.finish) <> (t2.start,t2.finish)
)
) interval_start
left join (
select distinct t1.grp, t1.start, t1.finish
from Table_lag t1
where not exists (
select 1
from Table_lag t2
where t1.grp=t2.grp
and t1.start between t2.start and t2.finish+1
and (t1.start,t1.finish) <> (t2.start,t2.finish)
)
) later_interval_start
on interval_start.grp=later_interval_start.grp
and interval_start.start < later_interval_start.start
group by interval_start.grp, interval_start.start
) as interval_start_range
join Table_lag merged
on merged.grp=interval_start_range.grp
and merged.start >= interval_start_range.start
and (interval_start_range.next_start is null or merged.start < interval_start_range.next_start)
group by interval_start_range.grp, interval_start_range.start
order by interval_start_range.start, interval_start_range.grp
(我已将此处的列重命名为不需要反引号。)
这里有一个 select 来获取我们将报告的可报告间隔的所有开始,加入另一个类似的 select (您可以使用 CTE 来避免冗余)找到以下内容同一组的可报告间隔的开始(如果有的话)。它包装在子查询中以获取组、起始值和以下可报告间隔的起始值。然后它只需要加入该范围内开始的所有其他记录并选择最大结束值。
https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=151cc933489c299f7beefa99e1959549
您只能使用 window 函数,使用一些 gaps-and-island 技术来做到这一点。
我们的想法是使用 lag()
和 window sum()
构建一组具有相同组和重叠范围的连续记录。然后您可以聚合这些组:
select grp, min(c_from) c_from, max(c_until) c_until
from (
select
t.*,
sum(lag_c_until < c_from) over(partition by grp order by c_from) mygrp
from (
select
t.*,
lag(c_until, 1, c_until) over(partition by grp order by c_from) lag_c_until
from mytable t
) t
) t
group by grp, mygrp
您选择的列名称与 SQL 关键字(group
、from
)冲突,因此我将它们重命名为 grp
、c_from
和 c_until
.
Demo on DB Fiddle - with credits to ysth 首先创建 fiddle:
grp | c_from | c_until :-- | -----: | ------: X | 1 | 4 Y | 5 | 7 X | 8 | 10 Y | 11 | 13