递归 CTE 遍历网络,但根据属性在交叉点选择路径

Recursive CTE to walk up a network, but choosing path at junctions based on an attribute

我正在尝试通过“向上走”直到到达终端节点来追踪河流网络图中的路径。问题是,当我们逆流而上时,我们会遇到支流或交叉路口,我们的路径可能会走向多个方向。我想使用链接的属性指定步行的方向。我正在尝试使用 postgres 构建一个 CTE 来实现这一点,但无法完全实现。

这是一个简单的网络图;到达红色和蓝色节点:

每个范围都有一个 area 属性。该图可以通过

构建
CREATE TEMP TABLE tree (
  id_reach integer PRIMARY KEY,
  from_node integer UNIQUE NOT NULL,
  to_node integer,
  area float
);

INSERT INTO tree(id_reach, from_node, to_node, area) VALUES
 (0, 1, 0, 40),
 (1, 3, 1, 29),
 (2, 2, 1, 5),
 (3, 4, 3, 7),
 (4, 5, 3, 6); 

我想做什么

我想在此图中指定一个节点,然后沿着最大 area 河段的路径向上游遍历。因此,对于所示示例,输出将显示为

 starting node |  path  |
---------------+---------
       0        {0, 1, 3}
       1        {1, 3}
       3        {3}

我不知道如何将 area 信息合并到 CTE 中。到目前为止,我想出的最好的 returns all 到达 starting_node 的上游(下面是 to_node = 1):

WITH RECURSIVE us_path AS (
    SELECT id_reach,
        from_node,
        to_node
    FROM tree
    WHERE to_node = 1
    UNION ALL
    SELECT tr.id_reach,
        tr.from_node,
        tr.to_node
    FROM tree as tr
        JOIN us_path AS usp ON tr.to_node = usp.from_node
)
SELECT id_reach
FROM us_path;

当然是这个returns

1
2
3
4

这些都是 id_reach 的上游。

我怎样才能将 CTE 修改为仅 return 遵循具有最大 areaid_reach 的路径?

我试过的

我试过 post- 处理上面 CTE 的结果,但我找不到将结果过滤到 return 这个路径的方法。我尝试将 WHEREHAVING 语句插入 CTE 以尝试 select 达到最大 area 的范围,但也无法使它们起作用。

更新

下面给出了一些很好的答案,所有这些都让我学到了新东西。已接受的答案对于我的用例来说是最简单和最快的,尽管其他答案提供了一些概括问题的机制。

这里是你如何做到的:

WITH RECURSIVE us_path AS (
    select *
    from (
        SELECT * , row_number() over (order by area desc) maxarea       
        FROM tree
        WHERE to_node = 0
    ) t where maxarea = 1 
    UNION ALL
    select *
    from (select tr.*,row_number() over (order by tr.area desc) maxarea
    FROM tree as tr
    join us_path AS usp on tr.to_node = usp.from_node
    ) t where maxarea = 1     
)
SELECT id_reach, area
FROM us_path;

或者如果您需要它用于所有可能的起始节点:

WITH RECURSIVE us_path AS (
    select *
    from (
        SELECT to_node, from_node, id_reach,area , row_number() over (partition by to_node order by area desc) maxarea       
        FROM tree
    ) t where maxarea = 1 
    UNION ALL
    select * 
    from (select usp.to_node, tr.from_node,tr.id_reach,tr.area,row_number() over (partition by usp.to_node order by tr.area desc) maxarea
    FROM tree as tr
    join us_path AS usp 
        ON tr.to_node = usp.from_node
    ) t where maxarea = 1 
)
SELECT to_node, array_agg(id_reach), sum(area) total_area
FROM us_path
group by to_node
order by to_node

db<>fiddle here

这应该有效


with recursive pt as (
select distinct to_node as start_node,
                null::integer as to_node,
                first_value(from_node) over (partition by to_node order by area desc) as from_node,
                first_value(id_reach) over (partition by to_node order by area desc) as longest_path
    from tree
union all
select distinct pt.start_node as start_node,
                t1.to_node as to_node,
                first_value(t1.from_node) over (partition by t1.to_node order by area desc) as from_node,
                first_value(id_reach) over (partition by t1.to_node order by area desc) as longest_path
    from tree t1
    join pt on pt.from_node = t1.to_node
)
select pt.start_node, array_agg(longest_path) from pt group by pt.start_node

给出结果:

start_node  path
0           {0,1,3}
1           {1,3}
3           {3}

第一个查询给出起点,在starting_point中使用null作为to_node保证不会再次选择该行。

我还确保每一行都包含起点,这样可以很容易地在最后阶段对所有河段进行分组。

如果您不熟悉 first_value 函数,它只会选择根据 over 子句找到的第一个值(每个 to_node 中面积最大的范围)

eshirvana 的回答使用 row_number 每组选择一行。

这是一个更短且可能更简洁的变体。 在递归部分我们只需要一行,所以 LIMIT 似乎是一个自然的选择。很可能它也会快一点。

WITH RECURSIVE us_path AS (

    SELECT *
    FROM tree
    WHERE to_node = 0
        
    UNION ALL
    
    select * 
    from (
        select tr.*
        FROM tree as tr
        join us_path AS usp 
            ON tr.to_node = usp.from_node
        ORDER BY tr.area desc
        LIMIT 1
    ) t
    
)
SELECT id_reach, area
FROM us_path;

方法有很多种。这里有两个优雅而快速的:

1。在 LATERAL 子查询

中获取具有最大 area 的下一个节点
WITH RECURSIVE us_path AS (
   -- start with biggest area for each node
   (  -- parentheses required
   SELECT DISTINCT ON (to_node)
          to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
   FROM   tree
   ORDER  BY to_node, area DESC NULLS LAST
   )
   UNION ALL
   SELECT tr.*
   FROM   us_path usp
   CROSS  JOIN LATERAL (
      SELECT usp.starting_node, tr.from_node, usp.path || tr.from_node
      FROM   tree tr
      WHERE  tr.to_node = usp.from_node
      ORDER  BY tr.area DESC NULLS LAST
      LIMIT  1  -- chose biggest area at each junction
      ) tr
   )
SELECT DISTINCT ON (starting_node)
       starting_node, path
FROM   us_path
ORDER  BY starting_node, cardinality(path) DESC;

db<>fiddle here

给定示例的结果:

 starting_node |   path    
---------------+-----------
             0 | {0,1,3,4}
             1 | {1,3,4}
             3 | {3,4}
(3 rows)

在每个交叉点,从一组并列最大的对等点中挑选任意行 area - 根据您的评论:

but in those cases choosing any of the ties would be acceptable.

要使选择具有确定性,请向每个 ORDER BY 子句添加更多表达式。

由于area可以是NULL(未定义NOT NULL),所以按降序添加NULLS LAST,否则NULL排在最前面。参见:

  • Sort NULL values to the end of a table
  • Select first row in each GROUP BY group?

在初始(非递归)项中,从每组具有相同 to_node 的边中,我只选择具有最大 area 的边。这样,永远不会到达具有劣质区域的叶子节点。您想要的结果似乎也一样。

(
SELECT DISTINCT ON (to_node)
       to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
FROM   tree
ORDER  BY to_node, area DESC NULLS LAST
)

括在括号中,因为:

  • Combine multiple SELECT statements

在外部 SELECT 我只为每个起始节点选择“最后”行,由数组的长度标识 (cardinality(path))

相关:

  • Optimize GROUP BY query to retrieve latest row per user

2。获取相关子查询中的下一个节点,在外部聚合 SELECT

WITH RECURSIVE us_path AS (
   -- start with biggest area for each node
   (  -- parentheses required
   SELECT DISTINCT ON (to_node)
          to_node AS starting_node, from_node, 1 AS step
   FROM   tree
   ORDER  BY to_node, area DESC NULLS LAST  -- start with biggest area for each node
   )
   UNION ALL
   SELECT usp.starting_node
        ,(SELECT tr.from_node
          FROM   tree tr
          WHERE  tr.to_node = usp.from_node
          ORDER  BY tr.area DESC NULLS LAST
          LIMIT  1)  -- chose biggest area at each junction
        , step + 1
   FROM   us_path usp      
   WHERE  usp.from_node IS NOT NULL
   )
SELECT starting_node, starting_node || array_agg(from_node)
FROM  (
   SELECT *
   FROM   us_path
   WHERE  from_node IS NOT NULL
   ORDER  BY starting_node, step
   ) sub
GROUP  BY 1;

db<>fiddle here

相同的结果。

这次我们需要明确地消除附加的 NULL 值。与上面的 CROSS JOIN 不同,相关子查询不会消除这些。