在 Postgresql 中按最大日期查询值

Question

我已经问过这个问题但是关于我的问题的信息较少。所以，我创建了一个包含更多信息的新问题。

这是我的示例 table。每行包含用户每次填写的数据。这样 timestamp 列就不会 null 贯穿整个 table。如果用户没有填写，item下可能有未记录的值。 id 是为每条记录自动生成的列。

CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);    
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;

为了看清楚，我把上面的table按id和timestamp排序。没关系。

我们正在使用 Postgresql 版本：PostgreSQL 9.5.19
实际 table 包含超过 400 万行
项目列包含 500 多个不同的项目，但请不要担心。我将最多使用 10 个项目进行查询。上面table我只用了4个
我们还有另一个名为 Customer_table 的 table，它具有包含客户一般信息的唯一 Customer_id。

根据上面的 table，我想查询数据以创建一个 table，其中包含如下所示的最新日期更新数据。我将最多使用 10 个项目进行查询，因此可能有 10 列。

customer_id  price  condition  feeling   weather .......(there may be other columns from item column)
   002        1900    null      fine      rain
   001        1400     ok       fine      null
   003        2000    bad       sad       sunny

这是我从获得的查询，但我只询问了两个 item。

SELECT customer_id, p.value AS price, c.value AS condition
FROM  (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'condition'
   ORDER  BY customer_id, timestamp DESC
   ) c
FULL JOIN (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'price'
   ORDER  BY customer_id, timestamp DESC
   ) p USING (customer_id)

所以，如果有更好的解决方案，请帮助我。谢谢。

Answer 1

您可以尝试使用 row_number 的其他方法来生成一个值以根据最新数据过滤您的数据。然后，您可以根据所需的行号 rn=1（我们将按降序排序）和项目名称过滤您的记录的案例表达式的最大值聚合客户 ID。

这些方法不那么冗长，而且根据在线结果来看，它们的性能似乎更高。在评论中让我知道如何在您的环境中复制它。

您可以使用 EXPLAIN ANALYZE 将此方法与当前方法进行比较。在线环境提供的结果：

当前方法

| Planning time: 0.129 ms                                                                                                      
| Execution time: 0.056 ms

建议的方法 1

| Planning time: 0.061 ms                                                                                                 
| Execution time: 0.070 ms

建议方法 2

| Planning time: 0.047 ms                                                                                                 
| Execution time: 0.056 ms

注意。 您可以使用 EXPLAIN ANALYZE 在您的环境中比较这些我们无法在线复制的方法。每个运行的结果也可能不同。还建议在 item 列上使用索引和早期过滤器以提高性能。

架构 (PostgreSQL v9.5)

建议的方法 1

SELECT
    t1.customer_id,
    MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
    MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
    MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
    MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    -- ensure that you filter based on your desired items
    -- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
   1;

customer_id	conditio	price	feeling	weather
001	ok	1400	fine
002		1900	fine	rain
003	bad	2000	sad	sunny

建议方法 2

SELECT
    t1.customer_id,
    MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio,
    MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
    MAX(t1.value) FILTER (WHERE  t1.item='feeling')  as feeling,
    MAX(t1.value) FILTER (WHERE  t1.item='weather')  as weather
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    -- ensure that you filter based on your desired items
    -- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
   1;

customer_id	conditio	price	feeling	weather
001	ok	1400	fine
002		1900	fine	rain
003	bad	2000	sad	sunny

当前使用 EXPLAIN ANALYZE 的方法

EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM  (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'condition'
   ORDER  BY customer_id, timestamp DESC
   ) c
FULL JOIN (
   SELECT DISTINCT ON (customer_id)
          customer_id, value
   FROM   tbl
   WHERE  item = 'price'
   ORDER  BY customer_id, timestamp DESC
   ) p USING (customer_id);

QUERY PLAN
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1)
Merge Cond: (tbl.customer_id = tbl_1.customer_id)
Buffers: shared hit=2
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1)
Sort Key: tbl.customer_id, tbl."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1)
Filter: (item = 'condition'::text)
Rows Removed by Filter: 22
Buffers: shared hit=1
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1)
Buffers: shared hit=1
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1)
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1)
Filter: (item = 'price'::text)
Rows Removed by Filter: 17
Buffers: shared hit=1
Planning time: 0.129 ms
Execution time: 0.056 ms

使用 EXPLAIN ANALYZE 的建议方法 1

EXPLAIN(ANALYZE,BUFFERS)
SELECT
    t1.customer_id,
    MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
    MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
   1;

QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.061 ms
Execution time: 0.070 ms

使用 EXPLAIN ANALYZE 的建议方法 2

EXPLAIN(ANALYZE,BUFFERS)
SELECT
    t1.customer_id,
    MAX(t1.value) FILTER (WHERE  t1.item='price')  as price,
    MAX(t1.value) FILTER (WHERE  t1.item='condition')  as conditio
    
FROM (
    SELECT
        * ,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id,item
            ORDER BY tbl.timestamp DESC
        ) as rn
    FROM
        tbl 
    where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
   1;

QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.047 ms
Execution time: 0.056 ms

View working demo on DB Fiddle

Answer 2

你操作了一个大table。你提到了 400 万行，显然还在增长。在查询 ...

所有客户
所有项目
每个(customer_id, item)

几行

窄行（小行大小）

... 和 row_number() 很棒。也很短。
整个table必须在顺序扫描中处理。不会使用索引。
但更喜欢使用现代聚合 FILTER 语法的“方法 2”。它更清晰，更快。在此处查看性能测试：

For absolute performance, is SUM faster or COUNT?

方法 3：以 `crosstab()`

为中心

crosstab() 通常更快，尤其是对于多个项目。参见：

PostgreSQL Crosstab Query

SELECT *
FROM   crosstab(
   $$
   SELECT customer_id, item, value
   FROM  (
      SELECT customer_id, item, value
           , row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
      FROM   tbl t
      WHERE  item = ANY ('{condition,price,feeling,weather}')  -- your items here ...
      ) t1
   WHERE  rn = 1
   ORDER  BY customer_id, item
   $$
 , $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$  -- ... here ...
   ) AS ct (customer_id text, condition text, price text, feeling text, weather text);  -- ... and here ...

方法 4：`LATERAL` 子查询

如果顶部列出的一个或多个条件不适用，上述查询的性能会迅速下降。

对于初学者来说，最多只涉及“500 个不同的项目”中的 10 个。那是大 table 的最大 ~ 2%。相比之下，仅此一项就可以使以下查询之一（快得多）：

SELECT *
FROM  (SELECT customer_id FROM customer) c
LEFT   JOIN LATERAL (
   SELECT value AS condition
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'condition'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t1 ON true
LEFT   JOIN LATERAL (
   SELECT value AS price
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'price'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t2 ON true
LEFT   JOIN LATERAL (
   SELECT value AS feeling
   FROM   tbl t
   WHERE  t.customer_id = c.customer_id
   AND    t.item = 'feeling'
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) AS t3 ON true
--  ... more?

关于LEFT JOIN LATERAL：

重点是获得一个具有相对较少索引（仅）扫描的查询计划，以取代大 table 上昂贵的顺序扫描。
需要一个适用的index，显然：

CREATE INDEX ON tbl (customer_id, item);

或更好（在 Postgres 9.5 中）：

CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);

在 Postgres 11 或更高版本中，这会更好，但是：

CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);

参见 here or here or here。

如果只有少数项目感兴趣，这些项目的部分索引会更好。

方法 5：相关子查询

SELECT c.customer_id
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price'     ORDER BY t.timestamp DESC LIMIT 1) AS price
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling'   ORDER BY t.timestamp DESC LIMIT 1) AS feeling
     , (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather'   ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM   customer c;

不如 LATERAL 多才多艺，但足以满足此目的。与方法 4 相同的索引要求。

方法 5 将是 大多数情况下的性能之王。

db<>fiddle here

改进你的关系设计and/or升级到当前版本的 Postgres 也会有很长的路要走。

在 Postgresql 中按最大日期查询值

Query Value by Max Date in Postgresql

sql

postgresql

greatest-n-per-group

方法 3：以 `crosstab()`

方法 4：`LATERAL` 子查询

方法 5：相关子查询

在 Postgresql 中按最大日期查询值

Query Value by Max Date in Postgresql

sql

postgresql

greatest-n-per-group

方法 3：以 crosstab()

方法 4：LATERAL 子查询

方法 5：相关子查询

方法 3：以 `crosstab()`

方法 4：`LATERAL` 子查询