SQL 通过选择可能为空值的前几行进行分组

Question

范例table:

id	name	create_time	group_id
1	a	2022-01-01 12:00:00	group1
2	b	2022-01-01 13:00:00	group1
3	c	2022-01-01 12:00:00	NULL
4	d	2022-01-01 13:00:00	NULL
5	e	NULL	group2

我需要在这些条件下按 group_id 分组的前 1 行（最小 create_time）：

create_time 可以为空 - 它应该被视为最小值
group_id 可以为 null - 应返回所有具有可为空 group_id 的行（如果不可能，我们可以使用 coalesce(group_id, id) 或类似的东西，假设 ID 是唯一的并且永远不会与组 ID 冲突）
应该可以对查询应用分页（所以连接可能是个问题）
查询应该尽可能通用（所以没有特定于供应商的东西）。同样，如果不可能，它应该在 MySQL 5&8、PostgreSQL 9+ 和 H2

示例的预期输出：

id	name	create_time	group_id
1	a	2022-01-01 12:00:00	group1
3	c	2022-01-01 12:00:00	NULL
4	d	2022-01-01 13:00:00	NULL
5	e	NULL	group2

我已经在 SO 上阅读过类似的问题，但 90% 的答案都带有特定的关键字（许多答案都带有 PARTITION BY，例如 ) and others don't honor null values in the group condition columns and probably pagination (like ）。

Answer 1

我猜

SELECT id, name, MAX(create_time), group_id
FROM tb GROUP BY group_id 
UNION ALL
SELECT id, name, create_time, group_id
FROM tb WHERE group_id IS NULL
ORDER BY name

我应该指出 'name' 是一个保留字。

Answer 2

select * from T t1
where coalesce(create_time, 0) = (
    select min(coalesce(create_time, 0)) from T t2
    where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)

不确定您想象中的“分页”应该如何工作。这是一种方法：

and (
    select count(distinct coalesce(t2.group_id, t2.id)) from T t2
    where coalesce(t2.group_id, t2.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5 /* for example */
order by coalesce(t1.group_id, t1.id)

我假设存在从 0 到日期值的隐式转换，其结果值低于数据库中的所有值。不确定这是否可靠。（试试 '19000101' 代替？）否则其余的应该是通用的。您也可以使用与页面范围相同的方式对其进行参数化。

您还可能遇到 group_id 和 id 空格之间可能发生冲突的并发症。尽管混合数据类型会产生自己的问题，但您的似乎没有这个问题。

当您想按 name:

等其他列排序时，这一切都会变得更加困难

select * from T t1
where coalesce(create_time, 0) = (
    select min(coalesce(create_time, 0)) from T t2
    where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
) and (
    select count(*) from (
        select * from T t1
        where coalesce(create_time, 0) = (
            select min(coalesce(create_time, 0)) from T t2
            where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
        )
    ) t3
    where t3.name < t1.name or t3.name = t1.name
        and coalesce(t3.group_id, t3.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5
order by t1.name;

这确实处理了联系，但也做出了简化的假设，即 name 不能为 null，这会增加另一个小问题。至少你可以看到没有 CTE 和 window 函数是可能的，但预计这些函数的效率也会比运行.

低得多

https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=9697fd274e73f4fa7c1a3a48d2c78691

Answer 3

您可以使用 UNION ALL 组合两个查询。例如：

select id, name, create_time, group_id
from mytable
where group_id is not null
and not exists
(
  select null
  from mytable older
  where older.group_id = mytable.group_id
  and older.create_time < mytable.create_time  
)
union all
select id, name, create_time, group_id
from mytable
where group_id is null
order by id;

这是标准 SQL 并且非常基础。它应该适用于几乎每个 RDBMS。

关于分页：这通常代价高昂，因为您运行一次又一次地执行相同的查询，以便始终选择结果的“下一个”部分，而不是运行只查询一次。最好的方法通常是使用主键进入下一部分，这样就可以使用键上的索引。在上面的查询中，我们最好将 where id > :last_biggest_id 添加到查询中并限制结果，这将是标准 SQL 中的 fetch next <n> rows only。每次我们运行查询时，我们使用最后读取的 ID 作为 :last_biggest_id，所以我们从那里继续读取。

然而，变量在各种 DBMS 中的处理方式不同；最常见的是，它们前面有一个冒号、一个美元符号或一个 at 符号。而且标准的 fetch 子句也只有一些 DBMS 支持，而其他的则有 LIMIT 或 TOP 子句。

如果这些细微差别导致无法应用它们，那么您必须找到解决方法。对于变量，这可以是一行 table，其中包含最后读取的最大 ID。对于 fetch 子句，这可能意味着您可以根据需要简单地获取尽可能多的行并停在那里。当然这并不理想，因为 DBMS 不知道您只需要接下来的 n 行并且无法相应地优化执行计划。

然后可以选择不在 DBMS 中进行分页，而是将完整的结果读入您的应用程序并在那里处理分页（然后它就变成了一个纯粹的显示内容，当然会分配大量内存）。

SQL 通过选择可能为空值的前几行进行分组

SQL group by selecting top rows with possible nulls

sql

group-by

greatest-n-per-group