SQL 查询以查找被 A 中的一个时间间隔覆盖但未被 B 中的时间间隔覆盖的所有时间戳("subtract" 或 "except" 在多个时间间隔之间)
SQL query to find all timestamps covered by an interval in A but not covered by an interval in B ("subtract" or "except" between multiple intervals)
我在 PostgreSQL 9.4 数据库中有多个表,其中每行包含一个间隔作为两列 "start"(含)和 "stop"(不含)。
考虑以下伪代码(表格更复杂)。
CREATE TABLE left (
start TIMESTAMP,
stop TIMESTAMP,
[...]
);
CREATE TABLE right (
start TIMESTAMP,
stop TIMESTAMP,
[...]
);
间隔包括开始,但不包括停止。
我现在需要一个查询来查找所有可能的时间间隔,其中 "left" 中有一行覆盖该时间间隔,但 "right" 中的一行不能同时覆盖相同的时间间隔。
"left" 中的一个间隔可以在结果中被分割成任意数量的间隔、缩短或完全不存在。考虑下图,时间从左到右:
left [-----row 1------------------) [--row 2--) [--row 3----)
right [--row1--) [--row2--) [--row3--)
result [----) [----) [-------) [-----------)
在这个小例子中,"left" 有树行,每行代表三个间隔,"right" 有三行,每行代表另外三个间隔。
结果有四行间隔,它们一起涵盖了所有可能的时间戳,其中 "left" 中的 row/interval 覆盖了该时间戳,但 row/interval 中没有 [=38] =] 覆盖相同的时间戳。
实际上,这些表当然比每行三行大得多 - 事实上,我经常想要在具有 "start" 和 "stop" 列的两个子查询之间执行该算法。
我遇到了死胡同(实际上是多个死胡同),即将把所有记录提取到内存中并对问题应用一些过程编程...
非常感谢任何关于应用思维的解决方案或建议。
将列类型更改为 tsrange
(或创建适当的视图):
CREATE TABLE leftr (
duration tsrange
);
CREATE TABLE rightr (
duration tsrange
);
insert into leftr values
('[2015-01-03, 2015-01-20)'),
('[2015-01-25, 2015-02-01)'),
('[2015-02-08, 2015-02-15)');
insert into rightr values
('[2015-01-01, 2015-01-06)'),
('[2015-01-10, 2015-01-15)'),
('[2015-01-18, 2015-01-26)');
查询:
select duration* gap result
from (
select tsrange(upper(duration), lower(lead(duration) over (order by duration))) gap
from rightr
) inv
join leftr
on duration && gap
result
-----------------------------------------------
["2015-01-06 00:00:00","2015-01-10 00:00:00")
["2015-01-15 00:00:00","2015-01-18 00:00:00")
["2015-01-26 00:00:00","2015-02-01 00:00:00")
["2015-02-08 00:00:00","2015-02-15 00:00:00")
(4 rows)
想法:
l [-----row 1------------------) [--row 2--) [--row 3----)
r [--row1--) [--row2--) [--row3--)
inv(r) [----) [----) [------------------------->
l*inv(r) [----) [----) [-------) [-----------)
如果类型更改为 tsrange
不是一个选项,这里是使用 window 函数.
的替代解决方案
重要的思想是认识到只有区间的起点和终点是相关的。在第一步中,执行一系列开始和结束时间戳的转换。 (我使用数字来简化示例)。
insert into t_left
select 1,4 from dual union all
select 6,9 from dual union all
select 12,13 from dual
;
insert into t_right
select 2,3 from dual union all
select 5,7 from dual union all
select 8,10 from dual union all
select 11,14 from dual
;
with event as (
select i_start tst, 1 left_change, 0 right_change from t_left union all
select i_stop tst, -1 left_change, 0 right_change from t_left union all
select i_start tst, 0 left_change, 1 right_change from t_right union all
select i_stop tst, 0 left_change, -1 right_change from t_right
)
select tst, left_change, right_change,
sum(left_change) over (order by tst) as is_left,
sum(right_change) over (order by tst) as is_right,
'['||tst||','||lead(tst) over (order by tst) ||')' intrvl
from event
order by tst;
这以每个间隔的两个记录结束,一个用于开始 (+1),一个用于结束(CHANGE 列中的 -1)。
TST LEFT_CHANGE RIGHT_CHANGE IS_LEFT IS_RIGHT INTRVL
1 1 0 1 0 [1,2)
2 0 1 1 1 [2,3)
3 0 -1 1 0 [3,4)
4 -1 0 0 0 [4,5)
5 0 1 0 1 [5,6)
6 1 0 1 1 [6,7)
7 0 -1 1 0 [7,8)
8 0 1 1 1 [8,9)
9 -1 0 0 1 [9,10)
10 0 -1 0 0 [10,11)
11 0 1 0 1 [11,12)
12 1 0 1 1 [12,13)
13 -1 0 0 1 [13,14)
14 0 -1 0 0 [14,)
window SUM 函数
sum(left_change) over (order by tst)
添加到目前为止的所有更改,产生间隔 间隔 中的 1 和间隔 .[=16= 之外的 0 蜂鸣]
获得所有只剩下的(子)间隔的过滤器因此是微不足道的
is_left = 1 and is_right = 0
(子)间隔以当前行的时间戳开始,以下一个行的时间戳结束。
最后的笔记:
- 您可能需要添加 logik 以忽略 leghth 0 的间隔
- 我正在 Oracle 中测试,所以请重新检查 Postgres 功能
为了完整性:天真的方法,不使用区间类型。
[我使用了与@klin 相同的样本数据]
CREATE TABLE tleft (
start TIMESTAMP,
stop TIMESTAMP,
payload text
);
INSERT INTO tleft(start,stop) VALUES
-- ('2015-01-08', '2015-03-07'), ('2015-03-21', '2015-04-14'), ('2015-05-01', '2015-05-15') ;
('2015-01-03', '2015-01-20'), ('2015-01-25', '2015-02-01'), ('2015-02-08', '2015-02-15');
CREATE TABLE tright (
start TIMESTAMP,
stop TIMESTAMP,
payload text
);
INSERT INTO tright(start,stop) VALUES
-- ('2015-01-01', '2015-01-15'), ('2015-02-01', '2015-02-14'), ('2015-03-01', '2015-04-07') ;
('2015-01-01', '2015-01-06'), ('2015-01-10', '2015-01-15'), ('2015-01-18', '2015-01-26');
-- Combine all {start,stop} events into one time series
-- , encoding the event-type into a state change.
-- Note: this assumes non-overlapping intervals in both
-- left and right tables.
WITH zzz AS (
SELECT stamp, SUM(state) AS state
FROM (
SELECT 1 AS state, start AS stamp FROM tleft
UNION ALL
SELECT -1 AS state, stop AS stamp FROM tleft
UNION ALL
SELECT 2 AS state, start AS stamp FROM tright
UNION ALL
SELECT -2 AS state, stop AS stamp FROM tright
) zz
GROUP BY stamp
)
-- Reconstruct *all* (sub)intervals
-- , and calculate a "running sum" over the state variable
SELECT * FROM (
SELECT zzz.stamp AS zstart
, LEAD(zzz.stamp) OVER (www) AS zstop
, zzz.state
, row_number() OVER(www) AS rn
, SUM(state) OVER(www) AS sstate
FROM zzz
WINDOW www AS (ORDER BY stamp)
) sub
-- extract only the (starting) state we are interested in
WHERE sub.sstate = 1
ORDER BY sub.zstart
;
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 3
CREATE TABLE
INSERT 0 3
zstart | zstop | state | rn | sstate
---------------------+---------------------+-------+----+--------
2015-01-06 00:00:00 | 2015-01-10 00:00:00 | -2 | 3 | 1
2015-01-15 00:00:00 | 2015-01-18 00:00:00 | -2 | 5 | 1
2015-01-26 00:00:00 | 2015-02-01 00:00:00 | -2 | 9 | 1
2015-02-08 00:00:00 | 2015-02-15 00:00:00 | 1 | 11 | 1
(4 rows)
如果 tsrange
不是一个选项,也许存储过程是?
像这样:
--create tables
drop table if exists tdate1;
drop table if exists tdate2;
create table tdate1(start timestamp, stop timestamp);
create table tdate2(start timestamp, stop timestamp);
--populate tables
insert into tdate1(start, stop) values('2015-01-01 00:10', '2015-01-01 01:00');
insert into tdate2(start, stop) values('2015-01-01 00:00', '2015-01-01 00:20');
insert into tdate2(start, stop) values('2015-01-01 00:30', '2015-01-01 00:40');
insert into tdate2(start, stop) values('2015-01-01 00:50', '2015-01-01 01:20');
insert into tdate1(start, stop) values('2015-01-01 01:10', '2015-01-01 02:00');
insert into tdate1(start, stop) values('2015-01-01 02:10', '2015-01-01 03:00');
--stored procedure itself
create or replace function tdate_periods(out start timestamp, out stop timestamp)
returns setof record as
$$
declare
rec record;
laststart timestamp = null;
startdt timestamp = null;
stopdt timestamp = null;
begin
for rec in
select
t1.start as t1start,
t1.stop as t1stop,
t2.start as t2start,
t2.stop as t2stop
from tdate1 t1
left join tdate2 t2 on t2.stop > t1.start or t2.start > t1.stop
loop
if laststart <> rec.t1start or laststart is null then
if laststart is not null then
if startdt < stopdt then
start = startdt;
stop = stopdt;
return next;
startdt = stopdt;
end if;
end if;
startdt = rec.t1start;
stopdt = rec.t1stop;
laststart = startdt;
end if;
if rec.t2start is not null then
if startdt < rec.t2start then
start = startdt;
stop = rec.t2start;
return next;
end if;
startdt = rec.t2stop;
end if;
end loop;
if startdt is not null and startdt < stopdt then
start = startdt;
stop = stopdt;
return next;
end if;
end
$$ language plpgsql;
--call
select * from tdate_periods();
我在 PostgreSQL 9.4 数据库中有多个表,其中每行包含一个间隔作为两列 "start"(含)和 "stop"(不含)。
考虑以下伪代码(表格更复杂)。
CREATE TABLE left (
start TIMESTAMP,
stop TIMESTAMP,
[...]
);
CREATE TABLE right (
start TIMESTAMP,
stop TIMESTAMP,
[...]
);
间隔包括开始,但不包括停止。
我现在需要一个查询来查找所有可能的时间间隔,其中 "left" 中有一行覆盖该时间间隔,但 "right" 中的一行不能同时覆盖相同的时间间隔。
"left" 中的一个间隔可以在结果中被分割成任意数量的间隔、缩短或完全不存在。考虑下图,时间从左到右:
left [-----row 1------------------) [--row 2--) [--row 3----)
right [--row1--) [--row2--) [--row3--)
result [----) [----) [-------) [-----------)
在这个小例子中,"left" 有树行,每行代表三个间隔,"right" 有三行,每行代表另外三个间隔。
结果有四行间隔,它们一起涵盖了所有可能的时间戳,其中 "left" 中的 row/interval 覆盖了该时间戳,但 row/interval 中没有 [=38] =] 覆盖相同的时间戳。
实际上,这些表当然比每行三行大得多 - 事实上,我经常想要在具有 "start" 和 "stop" 列的两个子查询之间执行该算法。
我遇到了死胡同(实际上是多个死胡同),即将把所有记录提取到内存中并对问题应用一些过程编程...
非常感谢任何关于应用思维的解决方案或建议。
将列类型更改为 tsrange
(或创建适当的视图):
CREATE TABLE leftr (
duration tsrange
);
CREATE TABLE rightr (
duration tsrange
);
insert into leftr values
('[2015-01-03, 2015-01-20)'),
('[2015-01-25, 2015-02-01)'),
('[2015-02-08, 2015-02-15)');
insert into rightr values
('[2015-01-01, 2015-01-06)'),
('[2015-01-10, 2015-01-15)'),
('[2015-01-18, 2015-01-26)');
查询:
select duration* gap result
from (
select tsrange(upper(duration), lower(lead(duration) over (order by duration))) gap
from rightr
) inv
join leftr
on duration && gap
result
-----------------------------------------------
["2015-01-06 00:00:00","2015-01-10 00:00:00")
["2015-01-15 00:00:00","2015-01-18 00:00:00")
["2015-01-26 00:00:00","2015-02-01 00:00:00")
["2015-02-08 00:00:00","2015-02-15 00:00:00")
(4 rows)
想法:
l [-----row 1------------------) [--row 2--) [--row 3----)
r [--row1--) [--row2--) [--row3--)
inv(r) [----) [----) [------------------------->
l*inv(r) [----) [----) [-------) [-----------)
如果类型更改为 tsrange
不是一个选项,这里是使用 window 函数.
重要的思想是认识到只有区间的起点和终点是相关的。在第一步中,执行一系列开始和结束时间戳的转换。 (我使用数字来简化示例)。
insert into t_left
select 1,4 from dual union all
select 6,9 from dual union all
select 12,13 from dual
;
insert into t_right
select 2,3 from dual union all
select 5,7 from dual union all
select 8,10 from dual union all
select 11,14 from dual
;
with event as (
select i_start tst, 1 left_change, 0 right_change from t_left union all
select i_stop tst, -1 left_change, 0 right_change from t_left union all
select i_start tst, 0 left_change, 1 right_change from t_right union all
select i_stop tst, 0 left_change, -1 right_change from t_right
)
select tst, left_change, right_change,
sum(left_change) over (order by tst) as is_left,
sum(right_change) over (order by tst) as is_right,
'['||tst||','||lead(tst) over (order by tst) ||')' intrvl
from event
order by tst;
这以每个间隔的两个记录结束,一个用于开始 (+1),一个用于结束(CHANGE 列中的 -1)。
TST LEFT_CHANGE RIGHT_CHANGE IS_LEFT IS_RIGHT INTRVL
1 1 0 1 0 [1,2)
2 0 1 1 1 [2,3)
3 0 -1 1 0 [3,4)
4 -1 0 0 0 [4,5)
5 0 1 0 1 [5,6)
6 1 0 1 1 [6,7)
7 0 -1 1 0 [7,8)
8 0 1 1 1 [8,9)
9 -1 0 0 1 [9,10)
10 0 -1 0 0 [10,11)
11 0 1 0 1 [11,12)
12 1 0 1 1 [12,13)
13 -1 0 0 1 [13,14)
14 0 -1 0 0 [14,)
window SUM 函数
sum(left_change) over (order by tst)
添加到目前为止的所有更改,产生间隔 间隔 中的 1 和间隔 .[=16= 之外的 0 蜂鸣]
获得所有只剩下的(子)间隔的过滤器因此是微不足道的 (子)间隔以当前行的时间戳开始,以下一个行的时间戳结束。 最后的笔记:is_left = 1 and is_right = 0
为了完整性:天真的方法,不使用区间类型。 [我使用了与@klin 相同的样本数据]
CREATE TABLE tleft (
start TIMESTAMP,
stop TIMESTAMP,
payload text
);
INSERT INTO tleft(start,stop) VALUES
-- ('2015-01-08', '2015-03-07'), ('2015-03-21', '2015-04-14'), ('2015-05-01', '2015-05-15') ;
('2015-01-03', '2015-01-20'), ('2015-01-25', '2015-02-01'), ('2015-02-08', '2015-02-15');
CREATE TABLE tright (
start TIMESTAMP,
stop TIMESTAMP,
payload text
);
INSERT INTO tright(start,stop) VALUES
-- ('2015-01-01', '2015-01-15'), ('2015-02-01', '2015-02-14'), ('2015-03-01', '2015-04-07') ;
('2015-01-01', '2015-01-06'), ('2015-01-10', '2015-01-15'), ('2015-01-18', '2015-01-26');
-- Combine all {start,stop} events into one time series
-- , encoding the event-type into a state change.
-- Note: this assumes non-overlapping intervals in both
-- left and right tables.
WITH zzz AS (
SELECT stamp, SUM(state) AS state
FROM (
SELECT 1 AS state, start AS stamp FROM tleft
UNION ALL
SELECT -1 AS state, stop AS stamp FROM tleft
UNION ALL
SELECT 2 AS state, start AS stamp FROM tright
UNION ALL
SELECT -2 AS state, stop AS stamp FROM tright
) zz
GROUP BY stamp
)
-- Reconstruct *all* (sub)intervals
-- , and calculate a "running sum" over the state variable
SELECT * FROM (
SELECT zzz.stamp AS zstart
, LEAD(zzz.stamp) OVER (www) AS zstop
, zzz.state
, row_number() OVER(www) AS rn
, SUM(state) OVER(www) AS sstate
FROM zzz
WINDOW www AS (ORDER BY stamp)
) sub
-- extract only the (starting) state we are interested in
WHERE sub.sstate = 1
ORDER BY sub.zstart
;
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 3
CREATE TABLE
INSERT 0 3
zstart | zstop | state | rn | sstate
---------------------+---------------------+-------+----+--------
2015-01-06 00:00:00 | 2015-01-10 00:00:00 | -2 | 3 | 1
2015-01-15 00:00:00 | 2015-01-18 00:00:00 | -2 | 5 | 1
2015-01-26 00:00:00 | 2015-02-01 00:00:00 | -2 | 9 | 1
2015-02-08 00:00:00 | 2015-02-15 00:00:00 | 1 | 11 | 1
(4 rows)
如果 tsrange
不是一个选项,也许存储过程是?
像这样:
--create tables
drop table if exists tdate1;
drop table if exists tdate2;
create table tdate1(start timestamp, stop timestamp);
create table tdate2(start timestamp, stop timestamp);
--populate tables
insert into tdate1(start, stop) values('2015-01-01 00:10', '2015-01-01 01:00');
insert into tdate2(start, stop) values('2015-01-01 00:00', '2015-01-01 00:20');
insert into tdate2(start, stop) values('2015-01-01 00:30', '2015-01-01 00:40');
insert into tdate2(start, stop) values('2015-01-01 00:50', '2015-01-01 01:20');
insert into tdate1(start, stop) values('2015-01-01 01:10', '2015-01-01 02:00');
insert into tdate1(start, stop) values('2015-01-01 02:10', '2015-01-01 03:00');
--stored procedure itself
create or replace function tdate_periods(out start timestamp, out stop timestamp)
returns setof record as
$$
declare
rec record;
laststart timestamp = null;
startdt timestamp = null;
stopdt timestamp = null;
begin
for rec in
select
t1.start as t1start,
t1.stop as t1stop,
t2.start as t2start,
t2.stop as t2stop
from tdate1 t1
left join tdate2 t2 on t2.stop > t1.start or t2.start > t1.stop
loop
if laststart <> rec.t1start or laststart is null then
if laststart is not null then
if startdt < stopdt then
start = startdt;
stop = stopdt;
return next;
startdt = stopdt;
end if;
end if;
startdt = rec.t1start;
stopdt = rec.t1stop;
laststart = startdt;
end if;
if rec.t2start is not null then
if startdt < rec.t2start then
start = startdt;
stop = rec.t2start;
return next;
end if;
startdt = rec.t2stop;
end if;
end loop;
if startdt is not null and startdt < stopdt then
start = startdt;
stop = stopdt;
return next;
end if;
end
$$ language plpgsql;
--call
select * from tdate_periods();