SQL:连接具有有效起始/有效终止字段的表

SQL: join tables with valid from / valid to fields

我在 Postgresql 数据库中遇到了一个非常普遍的问题。许多 table 包含仅在一段时间内有效的条目,例如可能随时间演变的合同详细信息。

为了应对,提供了valid fromvalid to两个字段来表示行内容的有效期。每次合同更改时,都会在 table 中添加一行,其中包含实际信息和相应的有效日期。

主要问题出现在对具有重叠有效字段的 table 执行联接时。更准确地说,给定第一个 table:

fg     valid_from    valid_to    attr_table1
key1   2020-01-01   2020-01-18        A
key1   2020-01-19   null              B
key2   2020-01-01   2020-01-30        A
key2   2020-01-30   null              B

还有第二个table

fg     valid_from    valid_to    attr_table2
key1   2020-01-01   2020-01-10       1.0
key1   2020-01-10   null             3.0
key2   2020-01-01   2020-01-30      10.0
key2   2020-01-30   null            11.0

我想建立一个连接的table,其有效性字段嵌入了两个table的有效期,例如:

fg     valid_from   valid_to    attr_table1  attr_table2
key1   2020-01-01   2020-01-10        A         1.0
key1   2020-01-10   2020-01-18        A         3.0
key1   2020-01-18   null              B         3.0
key2   2020-01-01   2020-01-30        A         10.0
key2   2020-01-30   null              B         11.0

到目前为止,我最有说服力的尝试是切换到特定于 Postgresql 的类型 daterange,并使用 && 运算符(“有共同点”)。我将 valid fromvalid to 字段连接到 validity 字段中,下一个查询似乎可以完成这项工作:

select t1.fg,
       (case when upper(t1.validity) is null
             then case when (upper(t2.validity) is null) 
                       then case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), null)
                                 else daterange(lower(t2.validity), null)
                                 end
                       else case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), upper(t2.validity)) 
                                 else daterange(lower(t2.validity), upper(t2.validity)) 
                                 end
                       end
             when upper(t2.validity) is null
             then case when (upper(t1.validity) is null) 
                       then case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), null)
                                 else daterange(lower(t2.validity), null)
                                 end
                       else case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), upper(t1.validity)) 
                                 else daterange(lower(t2.validity), upper(t1.validity)) 
                                 end
                       end
             when lower(t1.validity) <= lower(t2.validity)
             then case when upper(t1.validity) >= upper(t2.validity) 
                       then daterange(lower(t2.validity), upper(t2.validity))
                       else daterange(lower(t2.validity), upper(t1.validity))
                       end
             else case when upper(t1.validity) >= upper(t2.validity) 
                       then daterange(lower(t1.validity), upper(t2.validity))
                       else daterange(lower(t1.validity), upper(t1.validity))
                       end
             end
            ) as validity,
       t1.attr_table1, 
       t2.attr_table2
  from table1 as t1 
       join table2 as t2
         on t1.fg = t2.fg
        and t1.validity && t2.validity
order by fg, validity

但是,当第一个 table 的起点与第二个中的任何条目都不匹配时,此查询失败。例如,在第一个和第二个 table 中增加一行,例如

在table 1:

key1 2019-12-25 2020-01-01 A

在table 2:

key1 2019-12-27 2020-01-01 -1

输出结果的第一行table是

key1 2019-12-27 2020-01-01 A -1

而不是

key1   2019-12-25   2019-12-27        A    null
key1   2019-12-27   2020-01-01        A    -1 

有人知道更好的方法吗?

编辑:创建 table1 和 table2 的代码,依赖于日期范围:

create table table1
( 
  fg text, 
  validity daterange, 
  attr_table1 text
);
insert into table1 
values
('key1', daterange('2020-01-01', '2020-01-18'),  'A'),
('key1', daterange('2020-01-18', null        ),  'B'),
('key2', daterange('2020-01-01', '2020-01-30'),  'A'),
('key2', daterange('2020-01-30', null        ),  'B')

create table table2
( 
  fg text, 
  validity daterange,  
  attr_table2 text
);
insert into table2 
values
('key1', daterange('2020-01-01', '2020-01-10'),   1.0),
('key1', daterange('2020-01-10', null        ),   3.0),
('key2', daterange('2020-01-01', '2020-01-30'),  10.0),
('key2', daterange('2020-01-30', null        ),  11.0)

赶去下次会议,稍后再写说明,暂时...

  • 主要依赖于行与行之间没有任何间隙
  • 还依赖于 attrib1 或 attrib2 的值永远不会为 null(null 将替换为以前的非 null 值)

给予...

with
  combined AS
(
  select fg, lower(validity) AS valid_from, attr_table1, NULL as attr_table2 from table1
  union all
  select fg, lower(validity) AS valid_from, NULL AS attr_table1, attr_table2 from table2
),
  aggregated AS
(
  select
    fg,
    valid_from,
    max(attr_table1)  as attr_table1,
    max(attr_table2)  as attr_table2,
    count(max(attr_table1)) over (partition by fg order by valid_from) attrib1_grp,
    count(max(attr_table2)) over (partition by fg order by valid_from) attrib2_grp
  from
    combined
  group by
    fg,
    valid_from
)
SELECT
  fg,
  valid_from,
  lead(valid_from) over (partition by fg order by valid_from)  as valid_to,
  max(attr_table1) over (partition by fg, attrib1_grp)         as attr_table1,
  max(attr_table2) over (partition by fg, attrib2_grp)         as attr_table2
from
  aggregated
order by
  fg,
  valid_from

演示:https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7d97c9623e5f9efb4d729775ff61e7b5


编辑:

以上代码依赖的前提是如果一个key的属性在eithertable中发生变化,结果集需要那个日期也有变化。

这意味着我们可以合并两个 table 只保留 valid_from,并使用 LEAD() (有时会从同一个 table 中获取 valid_from,有时会从另一个 table).

中获取 valid_from

这会在属性列中留下一堆空值。如果 attrib1 发生变化,attrib2 将在联合集中变为 NULL,反之亦然。

填充这些空值所需的是回顾新的时间序列以找到该属性的最新 NOT NULL 值。由于 LAST_VALUE() 没有 SKIP NULLS 选项,我自己推出了...

  • 有一个累计计数器,用于计算该属性 NOT NULL 的次数,并将其用作组标识符
  • 根据定义,组中的第一个属性将为 NOT NULL,所有行将为 NULL
  • 因此,为组取 MAX(attribute) 允许我获得所有 NULL 行的先前 NOT NULL

此演示使查看正在发生的计算变得更容易一些...


编辑:

认为这现在适用于间隙(属性隐式设置为NULL)和行属性 显式 设置为 NULL...

  • 假设没有两行(相同的键,相同的 table)在同一日期开始
    • 如果发生这种情况,该属性会选择该日期的 MAX() 值
  • 假设在前一行(相同的键,相同的 table)结束之前没有行可以开始
    • 如果发生这种情况,将返回乱码

(虽然我建议更严格的测试...)

with
  combined(
    fg, valid_from, attr_table1, attr_table2, atrib1_set, atrib2_set
  ) AS
(
  select fg, lower(validity), attr_table1, NULL       , 1, NULL::int from table1
  union all
  select fg, upper(validity), NULL,        NULL       , 1, NULL::int from table1
  union all
  select fg, lower(validity), NULL       , attr_table2, NULL, 1 from table2
  union all
  select fg, upper(validity), NULL       , NULL       , NULL, 1 from table2
),
  aggregated AS
(
  select
    fg,
    valid_from,
    max(attr_table1)  as attr_table1,
    max(attr_table2)  as attr_table2,
    count(max(atrib1_set)) over (partition by fg order by valid_from) attrib1_grp,
    count(max(atrib2_set)) over (partition by fg order by valid_from) attrib2_grp
  from
    combined
  where
    valid_from is not null
  group by
    fg,
    valid_from
)
SELECT
  fg,
  valid_from,
  lead(valid_from) over (partition by fg order by valid_from)  as valid_to,
  max(attr_table1) over (partition by fg, attrib1_grp)         as attr_table1,
  max(attr_table2) over (partition by fg, attrib2_grp)         as attr_table2
from
  aggregated
order by
  fg,
  valid_from

演示:

[更新]

  • 为每个 fg
  • 的所有现有时间跨度制作一个日历 table
  • LEFT 将 table1 和 table2 加入此 table
  • [为了便于比较,我将 valid_to NULL 更改为 infinity]

create table table1
(
  fg text,
  validity daterange,
  attr_table1 text
);

insert into table1
values
('key1', daterange('2019-12-25', '2020-01-01'),  'A'), -- NEW
('key1', daterange('2020-01-01', '2020-01-18'),  'A'),
('key1', daterange('2020-01-19', 'infinity'        ),  'B'),
('key2', daterange('2020-01-01', '2020-01-30'),  'A'),
('key2', daterange('2020-01-30', 'infinity'        ),  'B');

create table table2
(
  fg text,
  validity daterange,
  attr_table2 text
);
insert into table2
values
('key1', daterange('2019-12-27', '2020-01-01'),  -1  ), -- NEW
('key1', daterange('2020-01-01', '2020-01-10'),   1.0),
('key1', daterange('2020-01-10', 'infinity'        ),   3.0),
('key2', daterange('2020-01-01', '2020-01-30'),  10.0),
('key2', daterange('2020-01-30', 'infinity'        ),  11.0);


        -- Make a 'CALENDAR' table with all points in time (per fg)
        -- ---------------------------------------------------------
WITH pits AS (
        select distinct fg, lower(validity) as pit FROM table1
UNION
        select distinct fg, upper(validity) as pit FROM table1
UNION
        select distinct fg, lower(validity) as pit FROM table2
UNION
        select distinct fg, upper(validity) as pit FROM table2
        )
        -- combine all adjacent PITs to ranges
        -- ---------------------------------------
, pairs AS (
        SELECT fg, pit AS opit
        , lead(pit) OVER (PARTITION BY fg ORDER BY pit) AS npit
        from pits
        )
        -- Make dateranges from them
        -- --------------------------
, tablex AS (
        SELECT fg
        , daterange(opit,npit) AS validity
        FROM pairs
        WHERE npit IS NOT NULL
        -- ORDER BY 1,2;
        )
        -- Left join both table1 and table2 to all_rages
        -- ----------------------------------------------
SELECT tx.fg
        , tx.validity
        , t1.validity * t2.validity AS overlapped
        , t1.attr_table1
        , t2.attr_table2
FROM tablex tx
LEFT JOIN table1 t1 ON t1.fg = tx.fg AND t1.validity && tx.validity
LEFT JOIN table2 t2 ON t2.fg = tx.fg AND t2.validity && tx.validity
ORDER BY 1,2
        ;

结果:


DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 5
CREATE TABLE
INSERT 0 5
  fg  |        validity         |       overlapped        | attr_table1 | attr_table2 
------+-------------------------+-------------------------+-------------+-------------
 key1 | [2019-12-25,2019-12-27) |                         | A           | 
 key1 | [2019-12-27,2020-01-01) | [2019-12-27,2020-01-01) | A           | -1
 key1 | [2020-01-01,2020-01-10) | [2020-01-01,2020-01-10) | A           | 1.0
 key1 | [2020-01-10,2020-01-18) | [2020-01-10,2020-01-18) | A           | 3.0
 key1 | [2020-01-18,2020-01-19) |                         |             | 3.0
 key1 | [2020-01-19,infinity)   | [2020-01-19,infinity)   | B           | 3.0
 key2 | [2020-01-01,2020-01-30) | [2020-01-01,2020-01-30) | A           | 10.0
 key2 | [2020-01-30,infinity)   | [2020-01-30,infinity)   | B           | 11.0
(8 rows)