分层查询中的时间连接

temporal joins in hierarchical query

我想加入树的各个节点,确保 returned 的根到叶路径在时间上有效。棘手的部分是数据源的日期是有效的。

ID NVALUE VFROM VTO
1 A 2021-01-01 2021-01-31
1 B 2021-02-01 2021-02-28
2 C 2021-01-01 2021-02-28
3 D 2021-01-01 2021-01-31
3 E 2021-02-01 2021-02-28

链接简单地指向节点 ID(但不是它们的日期!)

LINK_CHILD LINK_PARENT
1 2
2 3

由此我想 return 有效路径及其有效期:

  1. A-C-D2021-01-012021-01-31
  2. 有效
  3. B-C-E2021-02-012021-02-28
  4. 有效

无效路径(例如 A-C-E 不应 returned,因为没有任何时刻所有三个节点都有效)。

我遇到的问题是“重叠”检查不是可传递的(因此 A 与 B 重叠,B 与 C 重叠 暗示 A 与 C 重叠).因此,在编写 connect by 查询时,每个级别都与下一个级别重叠,但生成的全局路径无效。

我设置的基本查询是

with src_nodes (id, nvalue, vfrom, vto) as (
    select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
    select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
    select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
    select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
    select 3, 'E', date '2021-02-01', date '2021-02-28'
    from dual
),
     src_links(link_child, link_parent) as (
         select 1, 2 from dual union all
         select 2, 3 from dual
     ),
     full_links as (
         select c.*
         from src_links c
         union
         select null, link_child
         from src_links a
         where not exists(select null from src_links b where b.link_parent = a.link_child)
     ),
     nodes_and_links as (
         select *
         from full_links a
                  join src_nodes n on n.id = a.link_parent)
select *
from nodes_and_links nl
start with nl.link_child is null
connect by prior nl.link_parent = nl.link_child and
           greatest(prior nl.vfrom, nl.vfrom) < 
           least(prior nl.vto, nl.vto)

我一直在玩这个。这是一个有趣的!我想出的是使用你所有的 CTE,并将最后的 SELECT 替换为以下内容:

hierarchy AS (
     SELECT
         SYS_CONNECT_BY_PATH(nl.NVALUE,'-')||'-' AS Path,
         nl.*
     FROM nodes_and_links nl
          --remove the following line to get all valid paths, not necessarily beginning "at the top"
          START WITH nl.link_child IS NULL
          CONNECT BY PRIOR nl.link_parent = nl.link_child
    )
    
SELECT
    h1.Path,
    MAX(h2.VFROM) AS VFROM,
    MIN(h2.VTO) AS VTO
FROM
    hierarchy h1
    INNER JOIN hierarchy h2 ON h1.Path like ('%' || h2.Path || '%')
WHERE
  --This where clause ensures you get only cases where there is no further child record to be had.
    NOT EXISTS (SELECT 1 FROM src_links sr WHERE sr.link_child  = h1.id)
GROUP BY
    h1.Path
HAVING
    MAX(h2.VFROM) <= MIN(h2.VTO)

我不会说没有更好的方法,很可能有,但这似乎有效。

这是一种方法。效率可能会提高一点,但首先要确保它能按预期使用您的真实数据。

with
  src_nodes (id, nvalue, vfrom, vto) as (
    select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
    select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
    select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
    select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
    select 3, 'E', date '2021-02-01', date '2021-02-28' from dual
  )
, src_links (link_child, link_parent) as (
    select 1, 2 from dual union all
    select 2, 3 from dual
  )
, vdates (vfrom, vmax) as (
    select distinct vfrom, max(vto) over ()
    from   src_nodes
  )
, w (vfrom, vto) as (
    select vfrom, nvl(lead(vfrom) over (order by vfrom) - 1, vmax)
    from   vdates
  )
, vlinks (n_child, n_parent, vfrom, vto) as (
    select sn1.nvalue, sn2.nvalue, w.vfrom, w.vto
    from   src_links sl cross join w
           join src_nodes sn1 on sl.link_child  = sn1.id 
                and w.vfrom >= sn1.vfrom and w.vto <= sn1.vto
           join src_nodes sn2 on sl.link_parent = sn2.id
                and w.vfrom >= sn2.vfrom and w.vto <= sn2.vto
  )
select  connect_by_root(n_child) || sys_connect_by_path(n_parent, ' - ') as pth,
        vfrom, vto
from    vlinks
where   connect_by_isleaf = 1
start   with n_child not in (select n_parent from vlinks)
connect by n_child = prior n_parent and prior vfrom = vfrom
;

PTH             VFROM      VTO       
--------------- ---------- ----------
A - C - D       2021-01-01 2021-01-31
B - C - E       2021-02-01 2021-02-28

我认为这是最有效的方法。一个 recursive with 查询和另一个获取叶子的简单查询。

这是一个包含更复杂数据源的示例:

dbfiddle

with src_nodes as (
    select 1 id, 'A' nvalue, date '2021-01-01' vfrom, date '2021-02-10' vto
    from dual
    union all
    select 1, 'B', date '2021-02-15', date '2021-02-28'
    from dual
    union all
    select 2, 'C', date '2021-01-01', date '2021-01-31'
    from dual
    union all
    select 2, 'D', date '2021-02-01', date '2021-02-28'
    from dual
    union all
    select 3, 'E', date '2021-01-01', date '2021-02-28'
    from dual
    union all
    select 4, 'F', date '2021-01-01', date '2021-01-31'
    from dual
    union all
    select 4, 'G', date '2021-02-01', date '2021-02-28'
    from dual
    union all
    select 5, 'H', date '2021-02-01', date '2021-02-28'
    from dual
    union all
    select 6, 'I', date '2021-02-10', date '2021-02-28'
    from dual

),
     src_links as (
         select 1 link_child, 2 link_parent
         from dual
         union all
         select 2, 3
         from dual
         union all
         select 3, 4
         from dual
         union all
         select 5, 6
         from dual
     ),
     -- use "recursive with" method instead of "connect by" to be able to
     -- refine the validity dates as we walk the tree
     hier (id, vfrom, vto, nvalue, lvl, root_id, tpath) as (
         select sn.id, sn.vfrom, sn.vto, sn.nvalue, 1 lvl, sn.id, sn.nvalue || ''
         from src_nodes sn
         where -- start with nodes that have no incoming parent link
             exists(select null from src_links a where a.link_child = sn.id)
           and not exists(select null from src_links a where a.link_parent = sn.id)
         union all
         select sn.id,
                greatest(sn.vfrom, hier.vfrom),
                least(sn.vto, hier.vto),
                sn.nvalue,
                hier.lvl + 1 lvl,
                hier.root_id,
                hier.tpath || '-' || sn.nvalue
         from hier
                  join src_links ln on ln.link_child = hier.id
                  join src_nodes sn on sn.id = ln.link_parent --
             and greatest(sn.vfrom, hier.vfrom) < least(sn.vto, hier.vto)
     ) -- use "depth first" to be able to detect leaf nodes
         search depth first by id set seq,
     hier_leaves as (
         select *
         from (
                  select a.*,
                         -- a difference of one means it's a normal 'depth first' step. otherwise it's a leaf
                         (case lead(a.lvl) over (order by a.seq) - a.lvl
                              when 1 then 'inner'
                              else 'leaf' end) path_type
                  from hier a)
         where path_type = 'leaf')
select hl.tpath, hl.vfrom, hl.vto
from hier_leaves hl;

我现在已经针对具有 300K 节点和 240K 链接的数据测试了这种方法,并且在 6 秒内解析了树(加上一些额外的旋转)。 ETL 在 10 分钟内完成了类似的工作。