SQL 查询以查找被 A 中的一个时间间隔覆盖但未被 B 中的时间间隔覆盖的所有时间戳("subtract" 或 "except" 在多个时间间隔之间)

SQL query to find all timestamps covered by an interval in A but not covered by an interval in B ("subtract" or "except" between multiple intervals)

我在 PostgreSQL 9.4 数据库中有多个表,其中每行包含一个间隔作为两列 "start"(含)和 "stop"(不含)。

考虑以下伪代码(表格更复杂)。

CREATE TABLE left (   
    start TIMESTAMP,   
    stop TIMESTAMP,   
    [...] 
);

CREATE TABLE right (
    start TIMESTAMP,   
    stop TIMESTAMP,   
    [...] 
);

间隔包括开始,但不包括停止。

我现在需要一个查询来查找所有可能的时间间隔,其中 "left" 中有一行覆盖该时间间隔,但 "right" 中的一行不能同时覆盖相同的时间间隔。

"left" 中的一个间隔可以在结果中被分割成任意数量的间隔、缩短或完全不存在。考虑下图,时间从左到右:

left     [-----row 1------------------)   [--row 2--)    [--row 3----)
right  [--row1--)    [--row2--)    [--row3--)         
result          [----)        [----)        [-------)    [-----------)

在这个小例子中,"left" 有树行,每行代表三个间隔,"right" 有三行,每行代表另外三个间隔。

结果有四行间隔,它们一起涵盖了所有可能的时间戳,其中 "left" 中的 row/interval 覆盖了该时间戳,但 row/interval 中没有 [=38] =] 覆盖相同的时间戳。

实际上,这些表当然比每行三行大得多 - 事实上,我经常想要在具有 "start" 和 "stop" 列的两个子查询之间执行该算法。

我遇到了死胡同(实际上是多个死胡同),即将把所有记录提取到内存中并对问题应用一些过程编程...

非常感谢任何关于应用思维的解决方案或建议。

将列类型更改为 tsrange(或创建适当的视图):

CREATE TABLE leftr (
    duration tsrange
);

CREATE TABLE rightr (
    duration tsrange
);

insert into leftr values
('[2015-01-03, 2015-01-20)'),
('[2015-01-25, 2015-02-01)'),
('[2015-02-08, 2015-02-15)');

insert into rightr values
('[2015-01-01, 2015-01-06)'),
('[2015-01-10, 2015-01-15)'),
('[2015-01-18, 2015-01-26)');

查询:

select duration* gap result
from (
    select tsrange(upper(duration), lower(lead(duration) over (order by duration))) gap
    from rightr
    ) inv
join leftr
on duration && gap

                    result                     
-----------------------------------------------
 ["2015-01-06 00:00:00","2015-01-10 00:00:00")
 ["2015-01-15 00:00:00","2015-01-18 00:00:00")
 ["2015-01-26 00:00:00","2015-02-01 00:00:00")
 ["2015-02-08 00:00:00","2015-02-15 00:00:00")
(4 rows)    

想法:

l          [-----row 1------------------)   [--row 2--)    [--row 3----)
r        [--row1--)    [--row2--)    [--row3--)
inv(r)            [----)        [----)        [------------------------->
l*inv(r)          [----)        [----)        [-------)    [-----------)

如果类型更改为 tsrange 不是一个选项,这里是使用 window 函数.

的替代解决方案

重要的思想是认识到只有区间的起点和终点是相关的。在第一步中,执行一系列开始和结束时间戳的转换。 (我使用数字来简化示例)。

 insert into t_left 
 select 1,4 from dual union all
 select 6,9 from dual union all
 select 12,13 from dual    
 ;

 insert into t_right 
 select 2,3 from dual union all
 select 5,7 from dual union all
 select 8,10 from dual union all
 select 11,14 from dual    
 ;

 with event as  (
 select  i_start tst, 1 left_change, 0 right_change from t_left union all
 select  i_stop tst, -1 left_change, 0 right_change from t_left union all
 select  i_start  tst, 0 left_change, 1 right_change from t_right  union all
 select  i_stop tst, 0 left_change, -1 right_change from t_right
 )
 select tst, left_change, right_change,
 sum(left_change) over (order by tst) as is_left,
 sum(right_change) over (order by tst) as is_right,
 '['||tst||','||lead(tst) over (order by tst) ||')' intrvl
 from event
 order by tst;

这以每个间隔的两个记录结束,一个用于开始 (+1),一个用于结束(CHANGE 列中的 -1)。

   TST LEFT_CHANGE RIGHT_CHANGE    IS_LEFT   IS_RIGHT INTRVL         

     1           1            0          1          0 [1,2)     
     2           0            1          1          1 [2,3)     
     3           0           -1          1          0 [3,4)    
     4          -1            0          0          0 [4,5)    
     5           0            1          0          1 [5,6)    
     6           1            0          1          1 [6,7)      
     7           0           -1          1          0 [7,8)    
     8           0            1          1          1 [8,9)     
     9          -1            0          0          1 [9,10)     
    10           0           -1          0          0 [10,11)   
    11           0            1          0          1 [11,12)   
    12           1            0          1          1 [12,13)   
    13          -1            0          0          1 [13,14)   
    14           0           -1          0          0 [14,) 

window SUM 函数

 sum(left_change) over (order by tst) 

添加到目前为止的所有更改,产生间隔 间隔 中的 1 和间隔 .[=16= 之外的 0 蜂鸣]

获得所有只剩下的(子)间隔的过滤器因此是微不足道的

is_left = 1 and is_right = 0

(子)间隔以当前行的时间戳开始,以下一个行的时间戳结束。

最后的笔记:

  • 您可能需要添加 logik 以忽略 leghth 0 的间隔
  • 我正在 Oracle 中测试,所以请重新检查 Postgres 功能

为了完整性:天真的方法,不使用区间类型。 [我使用了与@klin 相同的样本数据]

CREATE TABLE tleft (
    start TIMESTAMP,
    stop TIMESTAMP,
    payload text
);

INSERT INTO tleft(start,stop) VALUES
-- ('2015-01-08', '2015-03-07'),  ('2015-03-21', '2015-04-14'), ('2015-05-01', '2015-05-15') ;
('2015-01-03', '2015-01-20'), ('2015-01-25', '2015-02-01'), ('2015-02-08', '2015-02-15');

CREATE TABLE tright (
    start TIMESTAMP,
    stop TIMESTAMP,
    payload text
);
INSERT INTO tright(start,stop) VALUES
 -- ('2015-01-01', '2015-01-15'),  ('2015-02-01', '2015-02-14'), ('2015-03-01', '2015-04-07') ;
('2015-01-01', '2015-01-06'), ('2015-01-10', '2015-01-15'), ('2015-01-18', '2015-01-26');

        -- Combine all {start,stop} events into one time series
        -- , encoding the event-type into a state change.
        -- Note: this assumes non-overlapping intervals in both
        -- left and right tables.
WITH zzz AS (
    SELECT stamp, SUM(state) AS state
    FROM (
    SELECT 1 AS state, start AS stamp FROM tleft
        UNION ALL
        SELECT -1 AS state, stop AS stamp FROM tleft
    UNION ALL
        SELECT 2 AS state, start AS stamp FROM tright
        UNION ALL
        SELECT -2 AS state, stop AS stamp FROM tright
        ) zz
    GROUP BY stamp
    )
    -- Reconstruct *all* (sub)intervals
    -- , and calculate a "running sum" over the state variable
SELECT * FROM (
    SELECT zzz.stamp AS zstart
        , LEAD(zzz.stamp) OVER (www) AS zstop
        , zzz.state
        , row_number() OVER(www) AS rn
        , SUM(state) OVER(www) AS sstate
FROM zzz
        WINDOW www AS (ORDER BY stamp)
        ) sub
        -- extract only the (starting) state we are interested in
WHERE sub.sstate = 1
ORDER BY sub.zstart
        ;

结果:

DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 3
CREATE TABLE
INSERT 0 3
       zstart        |        zstop        | state | rn | sstate 
---------------------+---------------------+-------+----+--------
 2015-01-06 00:00:00 | 2015-01-10 00:00:00 |    -2 |  3 |      1
 2015-01-15 00:00:00 | 2015-01-18 00:00:00 |    -2 |  5 |      1
 2015-01-26 00:00:00 | 2015-02-01 00:00:00 |    -2 |  9 |      1
 2015-02-08 00:00:00 | 2015-02-15 00:00:00 |     1 | 11 |      1
(4 rows)

如果 tsrange 不是一个选项,也许存储过程是? 像这样:

--create tables
drop table if exists tdate1;
drop table if exists tdate2;

create table tdate1(start timestamp, stop timestamp);
create table tdate2(start timestamp, stop timestamp);

--populate tables
insert into tdate1(start, stop) values('2015-01-01 00:10', '2015-01-01 01:00');
insert into tdate2(start, stop) values('2015-01-01 00:00', '2015-01-01 00:20');
insert into tdate2(start, stop) values('2015-01-01 00:30', '2015-01-01 00:40');
insert into tdate2(start, stop) values('2015-01-01 00:50', '2015-01-01 01:20');
insert into tdate1(start, stop) values('2015-01-01 01:10', '2015-01-01 02:00');
insert into tdate1(start, stop) values('2015-01-01 02:10', '2015-01-01 03:00');

--stored procedure itself
create or replace function tdate_periods(out start timestamp, out stop timestamp)
    returns setof record as
$$
declare
    rec record;
    laststart timestamp = null;
    startdt timestamp = null;
    stopdt timestamp = null;
begin
    for rec in
        select
                t1.start as t1start,
                t1.stop as t1stop,
                t2.start as t2start,
                t2.stop as t2stop
            from tdate1 t1
            left join tdate2 t2 on t2.stop > t1.start or t2.start > t1.stop
    loop
        if laststart <> rec.t1start or laststart is null then
            if laststart is not null then
                if startdt < stopdt then
                    start = startdt;
                    stop = stopdt;
                    return next;

                    startdt = stopdt;
                end if;
            end if;

            startdt = rec.t1start;
            stopdt = rec.t1stop;

            laststart = startdt;
        end if;

        if rec.t2start is not null then
            if startdt < rec.t2start then
                start = startdt;
                stop = rec.t2start;
                return next;
            end if;

            startdt = rec.t2stop;
        end if;
    end loop;

    if startdt is not null and startdt < stopdt then
        start = startdt;
        stop = stopdt;
        return next;
    end if;
end
$$ language plpgsql;

--call
select * from tdate_periods();