PostgreSQL 获取最近日期的所有记录

Question

我有这个查询运行非常大的数据集，但它太慢了。

SELECT *
FROM tableA
WHERE columnA =
        (SELECT MAX(columnA) -- select most recent date from entries we care about
            FROM tableA
            WHERE columnB = '1234' )
    AND columnC in (1,2,3) -- pull a subset out of those entries, this set here can be a thousand (ish) large.

table A 看起来像这样

pk	columnA	columnB	columnC
1	5/6/2022	1234	1
2	5/6/2022	1234	2
3	5/5/2022	0000	3
4	5/3/2022	0000	4

columnB 中有大约 1000 个不同的条目，table 中有很多数量级。有没有更好的方法来构造查询？或者我可以将列添加到 table 以使其更快？

Answer 1

我怀疑这将是花费最多时间的最后一行，因为必须解析列表。

AND columnC in (1,2,3) 
-- pull a subset out of those entries, this set here can be a thousand (ish) large.

最好将这些值放在带有索引（PRIMARY KEY）的 table 中，以便查询仅参考索引。

Join tableX X
On x.id = columnC;

我们还可以在 A 列和 B 列上创建索引。
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=6223777b7cbfa986d1eb852ac08aeaaf

Answer 2

您可以使用 window 函数来提高性能。例如，

SELECT *
FROM tableA
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY columnB
    ORDER BY columnA DESC
) = 1

以上查询将 select 最近的列 B，按列 A 排序。

但是，您似乎对 pk 的 1 和 2 持平...因此您可能需要考虑在 orderby 子句中添加 tie-breaker。

我不确定 postgres 语法是否略有不同，但另一种方法是：

SELECT 
  a.* 
FROM 
  tableA as a 
  INNER JOIN (
    SELECT 
      columnB, 
      MAX_A 
    FROM 
      (
        SELECT 
          columnB, 
          MAX(columnsA) as MAX_A 
        FROM 
          tableA 
        GROUP BY 
          columnB
      ) as rsMax 
    GROUP BY 
      columnB, 
      MAX_A
  ) as rsUnique ON a.columnA = rsUnique.MAX_A 
  AND a.columnB = rsUnique.columnB

由于关系，我不得不嵌套内部子查询 rsMax 并对其进行重复数据删除。

我使用 Rasgo 生成了 SQL 并在 Snowflake 上进行了测试。在 postgres 中，这两种方法中至少有一种适合您。

PostgreSQL 获取最近日期的所有记录

PostgreSQL get all records with most recent date

sql

postgresql