SELECT 在 Postgres 中使用 WITH RECURSIVE 时 WHERE 子句中不允许使用 MAX 子查询

SELECT MAX subquery not allowed in WHERE clause when using WITH RECURSIVE in Postgres

This LeetCode problem 给定架构

CREATE TABLE IF NOT EXISTS
  Tasks (task_id int, subtasks_count int);

TRUNCATE TABLE Tasks;

INSERT INTO
  Tasks (task_id, subtasks_count)
VALUES
  ('1', '3'),
  ('2', '2'),
  ('3', '4');


CREATE TABLE IF NOT EXISTS
  Executed (task_id int, subtask_id int);

TRUNCATE TABLE Executed;

INSERT INTO
  Executed (task_id, subtask_id)
VALUES
  ('1', '2'),
  ('3', '1'),
  ('3', '2'),
  ('3', '3'),
  ('3', '4');

使用 MySQL 版本 8.0.23 时,有以下可能的解决方案:

WITH RECURSIVE possible_tasks_subtasks AS (
  SELECT
    task_id, subtasks_count as max_subtask_count, 1 AS subtask_id
  FROM
    Tasks
  UNION ALL
  SELECT
    task_id, max_subtask_count, subtask_id + 1
  FROM
    possible_tasks_subtasks
---> using SELECT MAX below is where the problem occurs with Postgres
  WHERE
    subtask_id < (SELECT MAX(max_subtask_count) FROM Tasks))
SELECT
  P.task_id, P.subtask_id
FROM
  possible_tasks_subtasks P
LEFT JOIN
  Executed E ON P.task_id = E.task_id AND P.subtask_id = E.subtask_id
WHERE
  E.task_id IS NULL OR E.subtask_id IS NULL;

使用 Postgres 13.1 进行尝试时,出现以下错误:

ERROR: aggregate functions are not allowed in WHERE

这让我感到奇怪,因为在 the docs 中为聚合函数提供了一个看似相似的解决方案(就在 WHERE 子句中使用 SELECT <aggregate-function> 而言):

SELECT city FROM weather WHERE temp_lo = (SELECT max(temp_lo) FROM weather);

如果我修改

WHERE
  subtask_id < (SELECT MAX(max_subtask_count) FROM Tasks)

在上面的解决方案代码块中是

WHERE
  subtask_id < (SELECT max_subtask_count FROM Tasks ORDER BY max_subtask_count DESC LIMIT 1)

那么 Postgres 不会抛出错误。作为完整性检查,我尝试了

SELECT * FROM tasks WHERE task_id < (SELECT MAX(subtasks_count) FROM Tasks);

只是为了确保我可以按照文档的建议在 WHERE 子句的子查询中使用 SELECT MAX,并且 this 按预期工作。

到目前为止我唯一能确定的是,这在某种程度上与使用 WITH RECURSIVE 时 Postgres 处理事物的方式有关。但是 WITH 查询中的 the docs 并没有说明在 WHERE 子句的子查询中使用聚合。

我在这里错过了什么?为什么这在 MySQL 中有效但在 Postgres 中无效?但更重要的是,为什么文档中提供的解决方案在使用 WITH RECURSIVE 时似乎不起作用(无论如何从我的阅读和实验来看)?

编辑: 有关 LeetCode 问题的更多上下文以及它要求您完成查询的内容:

Table: Tasks

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| task_id        | int     |
| subtasks_count | int     |
+----------------+---------+

Table: Executed

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| task_id       | int     |
| subtask_id    | int     |
+---------------+---------+

编写一个 SQL 查询来报告每个 task_id 缺少的子任务的 ID。 Return 任意顺序 的结果 table。查询结果格式如下例:

Tasks table:
+---------+----------------+
| task_id | subtasks_count |
+---------+----------------+
| 1       | 3              |
| 2       | 2              |
| 3       | 4              |
+---------+----------------+

Executed table:
+---------+------------+
| task_id | subtask_id |
+---------+------------+
| 1       | 2          |
| 3       | 1          |
| 3       | 2          |
| 3       | 3          |
| 3       | 4          |
+---------+------------+

Result table:
+---------+------------+
| task_id | subtask_id |
+---------+------------+
| 1       | 1          |
| 1       | 3          |
| 2       | 1          |
| 2       | 2          |
+---------+------------+

您不需要 MAX() 从任务 table 中找到子任务计数。只需从递归部分的初始查询中携带该信息即可。

我还会使用 NOT EXISTS 条件来获得此结果:

with recursive all_subtasks as (
  select task_id, 1 as subtask_id, subtasks_count 
  from tasks
  union all
  select t.task_id, p.subtask_id + 1, p.subtasks_count
  from tasks t
    join all_subtasks p on p.task_id = t.task_id
  where p.subtask_id  < p.subtasks_count
)
select st.task_id, st.subtask_id
from all_subtasks st
where not exists (select *
                  from executed e
                  where e.task_id = st.task_id
                    and e.subtask_id = st.subtask_id)
order by t.task_id, t.subtask_id;                    

在 Postgres 中,这可以使用 generate_series()

编写得更简单一些
select t.task_id, st.subtask_id
from tasks t
  cross join generate_series(1, t.subtasks_count) as st(subtask_id)
where not exists (select * 
                  from executed e
                  where e.task_id = t.task_id
                    and e.subtask_id = st.subtask_id)
order by t.task_id;

Online example


至于“为什么在递归部分不允许聚合”——答案很简单:Postgres 开发团队的任何人都没有实现它。

Tom Lane 的回复:

为了阅读方便转载:

As the query is written, the aggregate is over a field of possible_tasks_subtasks, making it illegal in WHERE, just as the error says. (From the point of view of the SELECT FROM Tasks subquery, it's a constant outer reference, not an aggregate of that subquery. This is per SQL spec.)

根据此指导,在 PostgreSQL 中成功重述查询如下:

WITH RECURSIVE possible_tasks_subtasks AS (
  SELECT
    task_id, subtasks_count, 1 AS subtask_id
  FROM
    Tasks
  UNION ALL
  SELECT
    task_id, subtasks_count, subtask_id + 1
  FROM
    possible_tasks_subtasks
  WHERE
    subtask_id < (SELECT MAX(subtasks_count) FROM Tasks))
SELECT
  P.task_id, P.subtask_id
FROM
  possible_tasks_subtasks P
LEFT JOIN
  Executed E ON P.task_id = E.task_id AND P.subtask_id = E.subtask_id
WHERE
  E.task_id IS NULL AND P.subtasks_count >= P.subtask_id;