SQL:连接具有有效起始/有效终止字段的表
SQL: join tables with valid from / valid to fields
我在 Postgresql 数据库中遇到了一个非常普遍的问题。许多 table 包含仅在一段时间内有效的条目,例如可能随时间演变的合同详细信息。
为了应对,提供了valid from
和valid to
两个字段来表示行内容的有效期。每次合同更改时,都会在 table 中添加一行,其中包含实际信息和相应的有效日期。
主要问题出现在对具有重叠有效字段的 table 执行联接时。更准确地说,给定第一个 table:
fg valid_from valid_to attr_table1
key1 2020-01-01 2020-01-18 A
key1 2020-01-19 null B
key2 2020-01-01 2020-01-30 A
key2 2020-01-30 null B
还有第二个table
fg valid_from valid_to attr_table2
key1 2020-01-01 2020-01-10 1.0
key1 2020-01-10 null 3.0
key2 2020-01-01 2020-01-30 10.0
key2 2020-01-30 null 11.0
我想建立一个连接的table,其有效性字段嵌入了两个table的有效期,例如:
fg valid_from valid_to attr_table1 attr_table2
key1 2020-01-01 2020-01-10 A 1.0
key1 2020-01-10 2020-01-18 A 3.0
key1 2020-01-18 null B 3.0
key2 2020-01-01 2020-01-30 A 10.0
key2 2020-01-30 null B 11.0
到目前为止,我最有说服力的尝试是切换到特定于 Postgresql 的类型 daterange
,并使用 &&
运算符(“有共同点”)。我将 valid from
和 valid to
字段连接到 validity
字段中,下一个查询似乎可以完成这项工作:
select t1.fg,
(case when upper(t1.validity) is null
then case when (upper(t2.validity) is null)
then case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), null)
else daterange(lower(t2.validity), null)
end
else case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), upper(t2.validity))
else daterange(lower(t2.validity), upper(t2.validity))
end
end
when upper(t2.validity) is null
then case when (upper(t1.validity) is null)
then case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), null)
else daterange(lower(t2.validity), null)
end
else case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), upper(t1.validity))
else daterange(lower(t2.validity), upper(t1.validity))
end
end
when lower(t1.validity) <= lower(t2.validity)
then case when upper(t1.validity) >= upper(t2.validity)
then daterange(lower(t2.validity), upper(t2.validity))
else daterange(lower(t2.validity), upper(t1.validity))
end
else case when upper(t1.validity) >= upper(t2.validity)
then daterange(lower(t1.validity), upper(t2.validity))
else daterange(lower(t1.validity), upper(t1.validity))
end
end
) as validity,
t1.attr_table1,
t2.attr_table2
from table1 as t1
join table2 as t2
on t1.fg = t2.fg
and t1.validity && t2.validity
order by fg, validity
但是,当第一个 table 的起点与第二个中的任何条目都不匹配时,此查询失败。例如,在第一个和第二个 table 中增加一行,例如
在table 1:
key1 2019-12-25 2020-01-01 A
在table 2:
key1 2019-12-27 2020-01-01 -1
输出结果的第一行table是
key1 2019-12-27 2020-01-01 A -1
而不是
key1 2019-12-25 2019-12-27 A null
key1 2019-12-27 2020-01-01 A -1
有人知道更好的方法吗?
编辑:创建 table1 和 table2 的代码,依赖于日期范围:
create table table1
(
fg text,
validity daterange,
attr_table1 text
);
insert into table1
values
('key1', daterange('2020-01-01', '2020-01-18'), 'A'),
('key1', daterange('2020-01-18', null ), 'B'),
('key2', daterange('2020-01-01', '2020-01-30'), 'A'),
('key2', daterange('2020-01-30', null ), 'B')
和
create table table2
(
fg text,
validity daterange,
attr_table2 text
);
insert into table2
values
('key1', daterange('2020-01-01', '2020-01-10'), 1.0),
('key1', daterange('2020-01-10', null ), 3.0),
('key2', daterange('2020-01-01', '2020-01-30'), 10.0),
('key2', daterange('2020-01-30', null ), 11.0)
赶去下次会议,稍后再写说明,暂时...
- 主要依赖于行与行之间没有任何间隙
- 还依赖于 attrib1 或 attrib2 的值永远不会为 null(null 将替换为以前的非 null 值)
给予...
with
combined AS
(
select fg, lower(validity) AS valid_from, attr_table1, NULL as attr_table2 from table1
union all
select fg, lower(validity) AS valid_from, NULL AS attr_table1, attr_table2 from table2
),
aggregated AS
(
select
fg,
valid_from,
max(attr_table1) as attr_table1,
max(attr_table2) as attr_table2,
count(max(attr_table1)) over (partition by fg order by valid_from) attrib1_grp,
count(max(attr_table2)) over (partition by fg order by valid_from) attrib2_grp
from
combined
group by
fg,
valid_from
)
SELECT
fg,
valid_from,
lead(valid_from) over (partition by fg order by valid_from) as valid_to,
max(attr_table1) over (partition by fg, attrib1_grp) as attr_table1,
max(attr_table2) over (partition by fg, attrib2_grp) as attr_table2
from
aggregated
order by
fg,
valid_from
演示:https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7d97c9623e5f9efb4d729775ff61e7b5
编辑:
以上代码依赖的前提是如果一个key的属性在eithertable中发生变化,结果集需要那个日期也有变化。
这意味着我们可以合并两个 table 只保留 valid_from,并使用 LEAD()
(有时会从同一个 table 中获取 valid_from,有时会从另一个 table).
中获取 valid_from
这会在属性列中留下一堆空值。如果 attrib1
发生变化,attrib2
将在联合集中变为 NULL
,反之亦然。
填充这些空值所需的是回顾新的时间序列以找到该属性的最新 NOT NULL
值。由于 LAST_VALUE()
没有 SKIP NULLS
选项,我自己推出了...
- 有一个累计计数器,用于计算该属性
NOT NULL
的次数,并将其用作组标识符
- 根据定义,组中的第一个属性将为
NOT NULL
,所有行将为 NULL
- 因此,为组取
MAX(attribute)
允许我获得所有 NULL
行的先前 NOT NULL
值
此演示使查看正在发生的计算变得更容易一些...
编辑:
我认为这现在适用于间隙(属性隐式设置为NULL
)和行属性 显式 设置为 NULL
...
- 假设没有两行(相同的键,相同的 table)在同一日期开始
- 如果发生这种情况,该属性会选择该日期的 MAX() 值
- 假设在前一行(相同的键,相同的 table)结束之前没有行可以开始
- 如果发生这种情况,将返回乱码
(虽然我建议更严格的测试...)
with
combined(
fg, valid_from, attr_table1, attr_table2, atrib1_set, atrib2_set
) AS
(
select fg, lower(validity), attr_table1, NULL , 1, NULL::int from table1
union all
select fg, upper(validity), NULL, NULL , 1, NULL::int from table1
union all
select fg, lower(validity), NULL , attr_table2, NULL, 1 from table2
union all
select fg, upper(validity), NULL , NULL , NULL, 1 from table2
),
aggregated AS
(
select
fg,
valid_from,
max(attr_table1) as attr_table1,
max(attr_table2) as attr_table2,
count(max(atrib1_set)) over (partition by fg order by valid_from) attrib1_grp,
count(max(atrib2_set)) over (partition by fg order by valid_from) attrib2_grp
from
combined
where
valid_from is not null
group by
fg,
valid_from
)
SELECT
fg,
valid_from,
lead(valid_from) over (partition by fg order by valid_from) as valid_to,
max(attr_table1) over (partition by fg, attrib1_grp) as attr_table1,
max(attr_table2) over (partition by fg, attrib2_grp) as attr_table2
from
aggregated
order by
fg,
valid_from
演示:
[更新]
- 为每个
fg
的所有现有时间跨度制作一个日历 table
- LEFT 将 table1 和 table2 加入此 table
- [为了便于比较,我将
valid_to
NULL 更改为 infinity
]
create table table1
(
fg text,
validity daterange,
attr_table1 text
);
insert into table1
values
('key1', daterange('2019-12-25', '2020-01-01'), 'A'), -- NEW
('key1', daterange('2020-01-01', '2020-01-18'), 'A'),
('key1', daterange('2020-01-19', 'infinity' ), 'B'),
('key2', daterange('2020-01-01', '2020-01-30'), 'A'),
('key2', daterange('2020-01-30', 'infinity' ), 'B');
create table table2
(
fg text,
validity daterange,
attr_table2 text
);
insert into table2
values
('key1', daterange('2019-12-27', '2020-01-01'), -1 ), -- NEW
('key1', daterange('2020-01-01', '2020-01-10'), 1.0),
('key1', daterange('2020-01-10', 'infinity' ), 3.0),
('key2', daterange('2020-01-01', '2020-01-30'), 10.0),
('key2', daterange('2020-01-30', 'infinity' ), 11.0);
-- Make a 'CALENDAR' table with all points in time (per fg)
-- ---------------------------------------------------------
WITH pits AS (
select distinct fg, lower(validity) as pit FROM table1
UNION
select distinct fg, upper(validity) as pit FROM table1
UNION
select distinct fg, lower(validity) as pit FROM table2
UNION
select distinct fg, upper(validity) as pit FROM table2
)
-- combine all adjacent PITs to ranges
-- ---------------------------------------
, pairs AS (
SELECT fg, pit AS opit
, lead(pit) OVER (PARTITION BY fg ORDER BY pit) AS npit
from pits
)
-- Make dateranges from them
-- --------------------------
, tablex AS (
SELECT fg
, daterange(opit,npit) AS validity
FROM pairs
WHERE npit IS NOT NULL
-- ORDER BY 1,2;
)
-- Left join both table1 and table2 to all_rages
-- ----------------------------------------------
SELECT tx.fg
, tx.validity
, t1.validity * t2.validity AS overlapped
, t1.attr_table1
, t2.attr_table2
FROM tablex tx
LEFT JOIN table1 t1 ON t1.fg = tx.fg AND t1.validity && tx.validity
LEFT JOIN table2 t2 ON t2.fg = tx.fg AND t2.validity && tx.validity
ORDER BY 1,2
;
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 5
CREATE TABLE
INSERT 0 5
fg | validity | overlapped | attr_table1 | attr_table2
------+-------------------------+-------------------------+-------------+-------------
key1 | [2019-12-25,2019-12-27) | | A |
key1 | [2019-12-27,2020-01-01) | [2019-12-27,2020-01-01) | A | -1
key1 | [2020-01-01,2020-01-10) | [2020-01-01,2020-01-10) | A | 1.0
key1 | [2020-01-10,2020-01-18) | [2020-01-10,2020-01-18) | A | 3.0
key1 | [2020-01-18,2020-01-19) | | | 3.0
key1 | [2020-01-19,infinity) | [2020-01-19,infinity) | B | 3.0
key2 | [2020-01-01,2020-01-30) | [2020-01-01,2020-01-30) | A | 10.0
key2 | [2020-01-30,infinity) | [2020-01-30,infinity) | B | 11.0
(8 rows)
我在 Postgresql 数据库中遇到了一个非常普遍的问题。许多 table 包含仅在一段时间内有效的条目,例如可能随时间演变的合同详细信息。
为了应对,提供了valid from
和valid to
两个字段来表示行内容的有效期。每次合同更改时,都会在 table 中添加一行,其中包含实际信息和相应的有效日期。
主要问题出现在对具有重叠有效字段的 table 执行联接时。更准确地说,给定第一个 table:
fg valid_from valid_to attr_table1
key1 2020-01-01 2020-01-18 A
key1 2020-01-19 null B
key2 2020-01-01 2020-01-30 A
key2 2020-01-30 null B
还有第二个table
fg valid_from valid_to attr_table2
key1 2020-01-01 2020-01-10 1.0
key1 2020-01-10 null 3.0
key2 2020-01-01 2020-01-30 10.0
key2 2020-01-30 null 11.0
我想建立一个连接的table,其有效性字段嵌入了两个table的有效期,例如:
fg valid_from valid_to attr_table1 attr_table2
key1 2020-01-01 2020-01-10 A 1.0
key1 2020-01-10 2020-01-18 A 3.0
key1 2020-01-18 null B 3.0
key2 2020-01-01 2020-01-30 A 10.0
key2 2020-01-30 null B 11.0
到目前为止,我最有说服力的尝试是切换到特定于 Postgresql 的类型 daterange
,并使用 &&
运算符(“有共同点”)。我将 valid from
和 valid to
字段连接到 validity
字段中,下一个查询似乎可以完成这项工作:
select t1.fg,
(case when upper(t1.validity) is null
then case when (upper(t2.validity) is null)
then case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), null)
else daterange(lower(t2.validity), null)
end
else case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), upper(t2.validity))
else daterange(lower(t2.validity), upper(t2.validity))
end
end
when upper(t2.validity) is null
then case when (upper(t1.validity) is null)
then case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), null)
else daterange(lower(t2.validity), null)
end
else case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), upper(t1.validity))
else daterange(lower(t2.validity), upper(t1.validity))
end
end
when lower(t1.validity) <= lower(t2.validity)
then case when upper(t1.validity) >= upper(t2.validity)
then daterange(lower(t2.validity), upper(t2.validity))
else daterange(lower(t2.validity), upper(t1.validity))
end
else case when upper(t1.validity) >= upper(t2.validity)
then daterange(lower(t1.validity), upper(t2.validity))
else daterange(lower(t1.validity), upper(t1.validity))
end
end
) as validity,
t1.attr_table1,
t2.attr_table2
from table1 as t1
join table2 as t2
on t1.fg = t2.fg
and t1.validity && t2.validity
order by fg, validity
但是,当第一个 table 的起点与第二个中的任何条目都不匹配时,此查询失败。例如,在第一个和第二个 table 中增加一行,例如
在table 1:
key1 2019-12-25 2020-01-01 A
在table 2:
key1 2019-12-27 2020-01-01 -1
输出结果的第一行table是
key1 2019-12-27 2020-01-01 A -1
而不是
key1 2019-12-25 2019-12-27 A null
key1 2019-12-27 2020-01-01 A -1
有人知道更好的方法吗?
编辑:创建 table1 和 table2 的代码,依赖于日期范围:
create table table1
(
fg text,
validity daterange,
attr_table1 text
);
insert into table1
values
('key1', daterange('2020-01-01', '2020-01-18'), 'A'),
('key1', daterange('2020-01-18', null ), 'B'),
('key2', daterange('2020-01-01', '2020-01-30'), 'A'),
('key2', daterange('2020-01-30', null ), 'B')
和
create table table2
(
fg text,
validity daterange,
attr_table2 text
);
insert into table2
values
('key1', daterange('2020-01-01', '2020-01-10'), 1.0),
('key1', daterange('2020-01-10', null ), 3.0),
('key2', daterange('2020-01-01', '2020-01-30'), 10.0),
('key2', daterange('2020-01-30', null ), 11.0)
赶去下次会议,稍后再写说明,暂时...
- 主要依赖于行与行之间没有任何间隙
- 还依赖于 attrib1 或 attrib2 的值永远不会为 null(null 将替换为以前的非 null 值)
给予...
with
combined AS
(
select fg, lower(validity) AS valid_from, attr_table1, NULL as attr_table2 from table1
union all
select fg, lower(validity) AS valid_from, NULL AS attr_table1, attr_table2 from table2
),
aggregated AS
(
select
fg,
valid_from,
max(attr_table1) as attr_table1,
max(attr_table2) as attr_table2,
count(max(attr_table1)) over (partition by fg order by valid_from) attrib1_grp,
count(max(attr_table2)) over (partition by fg order by valid_from) attrib2_grp
from
combined
group by
fg,
valid_from
)
SELECT
fg,
valid_from,
lead(valid_from) over (partition by fg order by valid_from) as valid_to,
max(attr_table1) over (partition by fg, attrib1_grp) as attr_table1,
max(attr_table2) over (partition by fg, attrib2_grp) as attr_table2
from
aggregated
order by
fg,
valid_from
演示:https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7d97c9623e5f9efb4d729775ff61e7b5
编辑:
以上代码依赖的前提是如果一个key的属性在eithertable中发生变化,结果集需要那个日期也有变化。
这意味着我们可以合并两个 table 只保留 valid_from,并使用 LEAD()
(有时会从同一个 table 中获取 valid_from,有时会从另一个 table).
这会在属性列中留下一堆空值。如果 attrib1
发生变化,attrib2
将在联合集中变为 NULL
,反之亦然。
填充这些空值所需的是回顾新的时间序列以找到该属性的最新 NOT NULL
值。由于 LAST_VALUE()
没有 SKIP NULLS
选项,我自己推出了...
- 有一个累计计数器,用于计算该属性
NOT NULL
的次数,并将其用作组标识符 - 根据定义,组中的第一个属性将为
NOT NULL
,所有行将为NULL
- 因此,为组取
MAX(attribute)
允许我获得所有NULL
行的先前NOT NULL
值
此演示使查看正在发生的计算变得更容易一些...
编辑:
我认为这现在适用于间隙(属性隐式设置为NULL
)和行属性 显式 设置为 NULL
...
- 假设没有两行(相同的键,相同的 table)在同一日期开始
- 如果发生这种情况,该属性会选择该日期的 MAX() 值
- 假设在前一行(相同的键,相同的 table)结束之前没有行可以开始
- 如果发生这种情况,将返回乱码
(虽然我建议更严格的测试...)
with
combined(
fg, valid_from, attr_table1, attr_table2, atrib1_set, atrib2_set
) AS
(
select fg, lower(validity), attr_table1, NULL , 1, NULL::int from table1
union all
select fg, upper(validity), NULL, NULL , 1, NULL::int from table1
union all
select fg, lower(validity), NULL , attr_table2, NULL, 1 from table2
union all
select fg, upper(validity), NULL , NULL , NULL, 1 from table2
),
aggregated AS
(
select
fg,
valid_from,
max(attr_table1) as attr_table1,
max(attr_table2) as attr_table2,
count(max(atrib1_set)) over (partition by fg order by valid_from) attrib1_grp,
count(max(atrib2_set)) over (partition by fg order by valid_from) attrib2_grp
from
combined
where
valid_from is not null
group by
fg,
valid_from
)
SELECT
fg,
valid_from,
lead(valid_from) over (partition by fg order by valid_from) as valid_to,
max(attr_table1) over (partition by fg, attrib1_grp) as attr_table1,
max(attr_table2) over (partition by fg, attrib2_grp) as attr_table2
from
aggregated
order by
fg,
valid_from
演示:
[更新]
- 为每个
fg
的所有现有时间跨度制作一个日历 table
- LEFT 将 table1 和 table2 加入此 table
- [为了便于比较,我将
valid_to
NULL 更改为infinity
]
create table table1
(
fg text,
validity daterange,
attr_table1 text
);
insert into table1
values
('key1', daterange('2019-12-25', '2020-01-01'), 'A'), -- NEW
('key1', daterange('2020-01-01', '2020-01-18'), 'A'),
('key1', daterange('2020-01-19', 'infinity' ), 'B'),
('key2', daterange('2020-01-01', '2020-01-30'), 'A'),
('key2', daterange('2020-01-30', 'infinity' ), 'B');
create table table2
(
fg text,
validity daterange,
attr_table2 text
);
insert into table2
values
('key1', daterange('2019-12-27', '2020-01-01'), -1 ), -- NEW
('key1', daterange('2020-01-01', '2020-01-10'), 1.0),
('key1', daterange('2020-01-10', 'infinity' ), 3.0),
('key2', daterange('2020-01-01', '2020-01-30'), 10.0),
('key2', daterange('2020-01-30', 'infinity' ), 11.0);
-- Make a 'CALENDAR' table with all points in time (per fg)
-- ---------------------------------------------------------
WITH pits AS (
select distinct fg, lower(validity) as pit FROM table1
UNION
select distinct fg, upper(validity) as pit FROM table1
UNION
select distinct fg, lower(validity) as pit FROM table2
UNION
select distinct fg, upper(validity) as pit FROM table2
)
-- combine all adjacent PITs to ranges
-- ---------------------------------------
, pairs AS (
SELECT fg, pit AS opit
, lead(pit) OVER (PARTITION BY fg ORDER BY pit) AS npit
from pits
)
-- Make dateranges from them
-- --------------------------
, tablex AS (
SELECT fg
, daterange(opit,npit) AS validity
FROM pairs
WHERE npit IS NOT NULL
-- ORDER BY 1,2;
)
-- Left join both table1 and table2 to all_rages
-- ----------------------------------------------
SELECT tx.fg
, tx.validity
, t1.validity * t2.validity AS overlapped
, t1.attr_table1
, t2.attr_table2
FROM tablex tx
LEFT JOIN table1 t1 ON t1.fg = tx.fg AND t1.validity && tx.validity
LEFT JOIN table2 t2 ON t2.fg = tx.fg AND t2.validity && tx.validity
ORDER BY 1,2
;
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 5
CREATE TABLE
INSERT 0 5
fg | validity | overlapped | attr_table1 | attr_table2
------+-------------------------+-------------------------+-------------+-------------
key1 | [2019-12-25,2019-12-27) | | A |
key1 | [2019-12-27,2020-01-01) | [2019-12-27,2020-01-01) | A | -1
key1 | [2020-01-01,2020-01-10) | [2020-01-01,2020-01-10) | A | 1.0
key1 | [2020-01-10,2020-01-18) | [2020-01-10,2020-01-18) | A | 3.0
key1 | [2020-01-18,2020-01-19) | | | 3.0
key1 | [2020-01-19,infinity) | [2020-01-19,infinity) | B | 3.0
key2 | [2020-01-01,2020-01-30) | [2020-01-01,2020-01-30) | A | 10.0
key2 | [2020-01-30,infinity) | [2020-01-30,infinity) | B | 11.0
(8 rows)