分层查询中的时间连接
temporal joins in hierarchical query
我想加入树的各个节点,确保 returned 的根到叶路径在时间上有效。棘手的部分是数据源的日期是有效的。
ID
NVALUE
VFROM
VTO
1
A
2021-01-01
2021-01-31
1
B
2021-02-01
2021-02-28
2
C
2021-01-01
2021-02-28
3
D
2021-01-01
2021-01-31
3
E
2021-02-01
2021-02-28
链接简单地指向节点 ID(但不是它们的日期!)
LINK_CHILD
LINK_PARENT
1
2
2
3
由此我想 return 有效路径及其有效期:
A-C-D
从 2021-01-01
到 2021-01-31
有效
B-C-E
从 2021-02-01
到 2021-02-28
有效
无效路径(例如 A-C-E
不应 returned,因为没有任何时刻所有三个节点都有效)。
我遇到的问题是“重叠”检查不是可传递的(因此 A 与 B 重叠,B 与 C 重叠 不 暗示 A 与 C 重叠).因此,在编写 connect by
查询时,每个级别都与下一个级别重叠,但生成的全局路径无效。
我设置的基本查询是
with src_nodes (id, nvalue, vfrom, vto) as (
select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
select 3, 'E', date '2021-02-01', date '2021-02-28'
from dual
),
src_links(link_child, link_parent) as (
select 1, 2 from dual union all
select 2, 3 from dual
),
full_links as (
select c.*
from src_links c
union
select null, link_child
from src_links a
where not exists(select null from src_links b where b.link_parent = a.link_child)
),
nodes_and_links as (
select *
from full_links a
join src_nodes n on n.id = a.link_parent)
select *
from nodes_and_links nl
start with nl.link_child is null
connect by prior nl.link_parent = nl.link_child and
greatest(prior nl.vfrom, nl.vfrom) <
least(prior nl.vto, nl.vto)
我一直在玩这个。这是一个有趣的!我想出的是使用你所有的 CTE,并将最后的 SELECT
替换为以下内容:
hierarchy AS (
SELECT
SYS_CONNECT_BY_PATH(nl.NVALUE,'-')||'-' AS Path,
nl.*
FROM nodes_and_links nl
--remove the following line to get all valid paths, not necessarily beginning "at the top"
START WITH nl.link_child IS NULL
CONNECT BY PRIOR nl.link_parent = nl.link_child
)
SELECT
h1.Path,
MAX(h2.VFROM) AS VFROM,
MIN(h2.VTO) AS VTO
FROM
hierarchy h1
INNER JOIN hierarchy h2 ON h1.Path like ('%' || h2.Path || '%')
WHERE
--This where clause ensures you get only cases where there is no further child record to be had.
NOT EXISTS (SELECT 1 FROM src_links sr WHERE sr.link_child = h1.id)
GROUP BY
h1.Path
HAVING
MAX(h2.VFROM) <= MIN(h2.VTO)
我不会说没有更好的方法,很可能有,但这似乎有效。
这是一种方法。效率可能会提高一点,但首先要确保它能按预期使用您的真实数据。
with
src_nodes (id, nvalue, vfrom, vto) as (
select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
select 3, 'E', date '2021-02-01', date '2021-02-28' from dual
)
, src_links (link_child, link_parent) as (
select 1, 2 from dual union all
select 2, 3 from dual
)
, vdates (vfrom, vmax) as (
select distinct vfrom, max(vto) over ()
from src_nodes
)
, w (vfrom, vto) as (
select vfrom, nvl(lead(vfrom) over (order by vfrom) - 1, vmax)
from vdates
)
, vlinks (n_child, n_parent, vfrom, vto) as (
select sn1.nvalue, sn2.nvalue, w.vfrom, w.vto
from src_links sl cross join w
join src_nodes sn1 on sl.link_child = sn1.id
and w.vfrom >= sn1.vfrom and w.vto <= sn1.vto
join src_nodes sn2 on sl.link_parent = sn2.id
and w.vfrom >= sn2.vfrom and w.vto <= sn2.vto
)
select connect_by_root(n_child) || sys_connect_by_path(n_parent, ' - ') as pth,
vfrom, vto
from vlinks
where connect_by_isleaf = 1
start with n_child not in (select n_parent from vlinks)
connect by n_child = prior n_parent and prior vfrom = vfrom
;
PTH VFROM VTO
--------------- ---------- ----------
A - C - D 2021-01-01 2021-01-31
B - C - E 2021-02-01 2021-02-28
我认为这是最有效的方法。一个 recursive with
查询和另一个获取叶子的简单查询。
这是一个包含更复杂数据源的示例:
with src_nodes as (
select 1 id, 'A' nvalue, date '2021-01-01' vfrom, date '2021-02-10' vto
from dual
union all
select 1, 'B', date '2021-02-15', date '2021-02-28'
from dual
union all
select 2, 'C', date '2021-01-01', date '2021-01-31'
from dual
union all
select 2, 'D', date '2021-02-01', date '2021-02-28'
from dual
union all
select 3, 'E', date '2021-01-01', date '2021-02-28'
from dual
union all
select 4, 'F', date '2021-01-01', date '2021-01-31'
from dual
union all
select 4, 'G', date '2021-02-01', date '2021-02-28'
from dual
union all
select 5, 'H', date '2021-02-01', date '2021-02-28'
from dual
union all
select 6, 'I', date '2021-02-10', date '2021-02-28'
from dual
),
src_links as (
select 1 link_child, 2 link_parent
from dual
union all
select 2, 3
from dual
union all
select 3, 4
from dual
union all
select 5, 6
from dual
),
-- use "recursive with" method instead of "connect by" to be able to
-- refine the validity dates as we walk the tree
hier (id, vfrom, vto, nvalue, lvl, root_id, tpath) as (
select sn.id, sn.vfrom, sn.vto, sn.nvalue, 1 lvl, sn.id, sn.nvalue || ''
from src_nodes sn
where -- start with nodes that have no incoming parent link
exists(select null from src_links a where a.link_child = sn.id)
and not exists(select null from src_links a where a.link_parent = sn.id)
union all
select sn.id,
greatest(sn.vfrom, hier.vfrom),
least(sn.vto, hier.vto),
sn.nvalue,
hier.lvl + 1 lvl,
hier.root_id,
hier.tpath || '-' || sn.nvalue
from hier
join src_links ln on ln.link_child = hier.id
join src_nodes sn on sn.id = ln.link_parent --
and greatest(sn.vfrom, hier.vfrom) < least(sn.vto, hier.vto)
) -- use "depth first" to be able to detect leaf nodes
search depth first by id set seq,
hier_leaves as (
select *
from (
select a.*,
-- a difference of one means it's a normal 'depth first' step. otherwise it's a leaf
(case lead(a.lvl) over (order by a.seq) - a.lvl
when 1 then 'inner'
else 'leaf' end) path_type
from hier a)
where path_type = 'leaf')
select hl.tpath, hl.vfrom, hl.vto
from hier_leaves hl;
我现在已经针对具有 300K 节点和 240K 链接的数据测试了这种方法,并且在 6 秒内解析了树(加上一些额外的旋转)。 ETL 在 10 分钟内完成了类似的工作。
我想加入树的各个节点,确保 returned 的根到叶路径在时间上有效。棘手的部分是数据源的日期是有效的。
ID | NVALUE | VFROM | VTO |
---|---|---|---|
1 | A | 2021-01-01 | 2021-01-31 |
1 | B | 2021-02-01 | 2021-02-28 |
2 | C | 2021-01-01 | 2021-02-28 |
3 | D | 2021-01-01 | 2021-01-31 |
3 | E | 2021-02-01 | 2021-02-28 |
链接简单地指向节点 ID(但不是它们的日期!)
LINK_CHILD | LINK_PARENT |
---|---|
1 | 2 |
2 | 3 |
由此我想 return 有效路径及其有效期:
A-C-D
从2021-01-01
到2021-01-31
有效
B-C-E
从2021-02-01
到2021-02-28
有效
无效路径(例如 A-C-E
不应 returned,因为没有任何时刻所有三个节点都有效)。
我遇到的问题是“重叠”检查不是可传递的(因此 A 与 B 重叠,B 与 C 重叠 不 暗示 A 与 C 重叠).因此,在编写 connect by
查询时,每个级别都与下一个级别重叠,但生成的全局路径无效。
我设置的基本查询是
with src_nodes (id, nvalue, vfrom, vto) as (
select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
select 3, 'E', date '2021-02-01', date '2021-02-28'
from dual
),
src_links(link_child, link_parent) as (
select 1, 2 from dual union all
select 2, 3 from dual
),
full_links as (
select c.*
from src_links c
union
select null, link_child
from src_links a
where not exists(select null from src_links b where b.link_parent = a.link_child)
),
nodes_and_links as (
select *
from full_links a
join src_nodes n on n.id = a.link_parent)
select *
from nodes_and_links nl
start with nl.link_child is null
connect by prior nl.link_parent = nl.link_child and
greatest(prior nl.vfrom, nl.vfrom) <
least(prior nl.vto, nl.vto)
我一直在玩这个。这是一个有趣的!我想出的是使用你所有的 CTE,并将最后的 SELECT
替换为以下内容:
hierarchy AS (
SELECT
SYS_CONNECT_BY_PATH(nl.NVALUE,'-')||'-' AS Path,
nl.*
FROM nodes_and_links nl
--remove the following line to get all valid paths, not necessarily beginning "at the top"
START WITH nl.link_child IS NULL
CONNECT BY PRIOR nl.link_parent = nl.link_child
)
SELECT
h1.Path,
MAX(h2.VFROM) AS VFROM,
MIN(h2.VTO) AS VTO
FROM
hierarchy h1
INNER JOIN hierarchy h2 ON h1.Path like ('%' || h2.Path || '%')
WHERE
--This where clause ensures you get only cases where there is no further child record to be had.
NOT EXISTS (SELECT 1 FROM src_links sr WHERE sr.link_child = h1.id)
GROUP BY
h1.Path
HAVING
MAX(h2.VFROM) <= MIN(h2.VTO)
我不会说没有更好的方法,很可能有,但这似乎有效。
这是一种方法。效率可能会提高一点,但首先要确保它能按预期使用您的真实数据。
with
src_nodes (id, nvalue, vfrom, vto) as (
select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
select 3, 'E', date '2021-02-01', date '2021-02-28' from dual
)
, src_links (link_child, link_parent) as (
select 1, 2 from dual union all
select 2, 3 from dual
)
, vdates (vfrom, vmax) as (
select distinct vfrom, max(vto) over ()
from src_nodes
)
, w (vfrom, vto) as (
select vfrom, nvl(lead(vfrom) over (order by vfrom) - 1, vmax)
from vdates
)
, vlinks (n_child, n_parent, vfrom, vto) as (
select sn1.nvalue, sn2.nvalue, w.vfrom, w.vto
from src_links sl cross join w
join src_nodes sn1 on sl.link_child = sn1.id
and w.vfrom >= sn1.vfrom and w.vto <= sn1.vto
join src_nodes sn2 on sl.link_parent = sn2.id
and w.vfrom >= sn2.vfrom and w.vto <= sn2.vto
)
select connect_by_root(n_child) || sys_connect_by_path(n_parent, ' - ') as pth,
vfrom, vto
from vlinks
where connect_by_isleaf = 1
start with n_child not in (select n_parent from vlinks)
connect by n_child = prior n_parent and prior vfrom = vfrom
;
PTH VFROM VTO
--------------- ---------- ----------
A - C - D 2021-01-01 2021-01-31
B - C - E 2021-02-01 2021-02-28
我认为这是最有效的方法。一个 recursive with
查询和另一个获取叶子的简单查询。
这是一个包含更复杂数据源的示例:
with src_nodes as (
select 1 id, 'A' nvalue, date '2021-01-01' vfrom, date '2021-02-10' vto
from dual
union all
select 1, 'B', date '2021-02-15', date '2021-02-28'
from dual
union all
select 2, 'C', date '2021-01-01', date '2021-01-31'
from dual
union all
select 2, 'D', date '2021-02-01', date '2021-02-28'
from dual
union all
select 3, 'E', date '2021-01-01', date '2021-02-28'
from dual
union all
select 4, 'F', date '2021-01-01', date '2021-01-31'
from dual
union all
select 4, 'G', date '2021-02-01', date '2021-02-28'
from dual
union all
select 5, 'H', date '2021-02-01', date '2021-02-28'
from dual
union all
select 6, 'I', date '2021-02-10', date '2021-02-28'
from dual
),
src_links as (
select 1 link_child, 2 link_parent
from dual
union all
select 2, 3
from dual
union all
select 3, 4
from dual
union all
select 5, 6
from dual
),
-- use "recursive with" method instead of "connect by" to be able to
-- refine the validity dates as we walk the tree
hier (id, vfrom, vto, nvalue, lvl, root_id, tpath) as (
select sn.id, sn.vfrom, sn.vto, sn.nvalue, 1 lvl, sn.id, sn.nvalue || ''
from src_nodes sn
where -- start with nodes that have no incoming parent link
exists(select null from src_links a where a.link_child = sn.id)
and not exists(select null from src_links a where a.link_parent = sn.id)
union all
select sn.id,
greatest(sn.vfrom, hier.vfrom),
least(sn.vto, hier.vto),
sn.nvalue,
hier.lvl + 1 lvl,
hier.root_id,
hier.tpath || '-' || sn.nvalue
from hier
join src_links ln on ln.link_child = hier.id
join src_nodes sn on sn.id = ln.link_parent --
and greatest(sn.vfrom, hier.vfrom) < least(sn.vto, hier.vto)
) -- use "depth first" to be able to detect leaf nodes
search depth first by id set seq,
hier_leaves as (
select *
from (
select a.*,
-- a difference of one means it's a normal 'depth first' step. otherwise it's a leaf
(case lead(a.lvl) over (order by a.seq) - a.lvl
when 1 then 'inner'
else 'leaf' end) path_type
from hier a)
where path_type = 'leaf')
select hl.tpath, hl.vfrom, hl.vto
from hier_leaves hl;
我现在已经针对具有 300K 节点和 240K 链接的数据测试了这种方法,并且在 6 秒内解析了树(加上一些额外的旋转)。 ETL 在 10 分钟内完成了类似的工作。