在 Postgresql 中按最大日期查询值
Query Value by Max Date in Postgresql
我已经问过这个问题 但是关于我的问题的信息较少。所以,我创建了一个包含更多信息的新问题。
这是我的示例 table。每行包含用户每次填写的数据。这样 timestamp 列就不会 null 贯穿整个 table。如果用户没有填写,item下可能有未记录的值。 id 是为每条记录自动生成的列。
CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;
为了看清楚,我把上面的table按id和timestamp排序。没关系。
- 我们正在使用 Postgresql 版本:PostgreSQL 9.5.19
- 实际 table 包含超过 400 万行
- 项目 列包含 500 多个不同的项目,但请不要担心。我将最多使用 10 个项目进行查询。上面table我只用了4个
- 我们还有另一个名为 Customer_table 的 table,它具有包含客户一般信息的唯一 Customer_id。
根据上面的 table,我想查询数据以创建一个 table,其中包含如下所示的最新日期更新数据。我将最多使用 10 个项目进行查询,因此可能有 10 列。
customer_id price condition feeling weather .......(there may be other columns from item column)
002 1900 null fine rain
001 1400 ok fine null
003 2000 bad sad sunny
这是我从 获得的查询,但我只询问了两个 item。
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id)
所以,如果有更好的解决方案,请帮助我。
谢谢。
您可以尝试使用 row_number
的其他方法来生成一个值以根据最新数据过滤您的数据。然后,您可以根据所需的行号 rn=1
(我们将按降序排序)和项目名称过滤您的记录的案例表达式的最大值聚合客户 ID。
这些方法不那么冗长,而且根据在线结果来看,它们的性能似乎更高。在评论中让我知道如何在您的环境中复制它。
您可以使用 EXPLAIN ANALYZE
将此方法与当前方法进行比较。在线环境提供的结果:
当前方法
| Planning time: 0.129 ms
| Execution time: 0.056 ms
建议的方法 1
| Planning time: 0.061 ms
| Execution time: 0.070 ms
建议方法 2
| Planning time: 0.047 ms
| Execution time: 0.056 ms
注意。 您可以使用 EXPLAIN ANALYZE
在您的环境中比较这些我们无法在线复制的方法。每个 运行 的结果也可能不同。还建议在 item
列上使用索引和早期过滤器以提高性能。
架构 (PostgreSQL v9.5)
建议的方法 1
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id
conditio
price
feeling
weather
001
ok
1400
fine
002
1900
fine
rain
003
bad
2000
sad
sunny
建议方法 2
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='feeling') as feeling,
MAX(t1.value) FILTER (WHERE t1.item='weather') as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id
conditio
price
feeling
weather
001
ok
1400
fine
002
1900
fine
rain
003
bad
2000
sad
sunny
当前使用 EXPLAIN ANALYZE 的方法
EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id);
QUERY PLAN
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1)
Merge Cond: (tbl.customer_id = tbl_1.customer_id)
Buffers: shared hit=2
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1)
Sort Key: tbl.customer_id, tbl."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1)
Filter: (item = 'condition'::text)
Rows Removed by Filter: 22
Buffers: shared hit=1
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1)
Buffers: shared hit=1
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1)
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1)
Filter: (item = 'price'::text)
Rows Removed by Filter: 17
Buffers: shared hit=1
Planning time: 0.129 ms
Execution time: 0.056 ms
使用 EXPLAIN ANALYZE 的建议方法 1
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.061 ms
Execution time: 0.070 ms
使用 EXPLAIN ANALYZE 的建议方法 2
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1)
Group Key: t1.customer_id
Buffers: shared hit=1
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1)
Filter: (t1.rn = 1)
Rows Removed by Filter: 10
Buffers: shared hit=1
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1)
Buffers: shared hit=1
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1)
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1)
Filter: (item = ANY ('{price,condition}'::text[]))
Rows Removed by Filter: 12
Buffers: shared hit=1
Planning time: 0.047 ms
Execution time: 0.056 ms
你操作了一个大table。你提到了 400 万行,显然还在增长。在查询 ...
- 所有客户
- 所有项目
- 每个
(customer_id, item)
几行
- 窄行(小行大小)
... 和 row_number()
很棒。也很短。
整个table必须在顺序扫描中处理。不会使用索引。
但更喜欢使用现代聚合 FILTER
语法的“方法 2”。它更清晰,更快。在此处查看性能测试:
方法 3:以 crosstab()
为中心
crosstab()
通常更快,尤其是对于多个项目。参见:
- PostgreSQL Crosstab Query
SELECT *
FROM crosstab(
$$
SELECT customer_id, item, value
FROM (
SELECT customer_id, item, value
, row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
FROM tbl t
WHERE item = ANY ('{condition,price,feeling,weather}') -- your items here ...
) t1
WHERE rn = 1
ORDER BY customer_id, item
$$
, $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$ -- ... here ...
) AS ct (customer_id text, condition text, price text, feeling text, weather text); -- ... and here ...
方法 4:LATERAL
子查询
如果顶部列出的一个或多个条件不适用,上述查询的性能会迅速下降。
对于初学者来说,最多只涉及“500 个不同的项目”中的 10 个。那是大 table 的最大 ~ 2%。相比之下,仅此一项就可以使以下查询之一(快得多):
SELECT *
FROM (SELECT customer_id FROM customer) c
LEFT JOIN LATERAL (
SELECT value AS condition
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'condition'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t1 ON true
LEFT JOIN LATERAL (
SELECT value AS price
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'price'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t2 ON true
LEFT JOIN LATERAL (
SELECT value AS feeling
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'feeling'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t3 ON true
-- ... more?
关于LEFT JOIN LATERAL
:
重点是获得一个具有相对较少索引(仅)扫描的查询计划,以取代大 table 上昂贵的顺序扫描。
需要一个适用的index,显然:
CREATE INDEX ON tbl (customer_id, item);
或更好(在 Postgres 9.5 中):
CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);
在 Postgres 11 或更高版本中,这会更好,但是:
CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);
如果只有少数项目感兴趣,这些项目的部分索引会更好。
方法 5:相关子查询
SELECT c.customer_id
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price' ORDER BY t.timestamp DESC LIMIT 1) AS price
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling' ORDER BY t.timestamp DESC LIMIT 1) AS feeling
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather' ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM customer c;
不如 LATERAL
多才多艺,但足以满足此目的。与方法 4 相同的索引要求。
方法 5 将是 大多数情况下的性能之王。
db<>fiddle here
改进你的关系设计and/or升级到当前版本的 Postgres 也会有很长的路要走。
我已经问过这个问题
这是我的示例 table。每行包含用户每次填写的数据。这样 timestamp 列就不会 null 贯穿整个 table。如果用户没有填写,item下可能有未记录的值。 id 是为每条记录自动生成的列。
CREATE TABLE tbl (id int, customer_id text, item text, value text, timestamp timestamp);
INSERT INTO tbl VALUES
(1, '001', 'price', '1000', '2021-11-01 01:00:00'),
(2, '001', 'price', '1500', '2021-11-02 01:00:00'),
(3, '001', 'price', '1400', '2021-11-03 01:00:00'),
(4, '001', 'condition', 'good', '2021-11-01 01:00:00'),
(5, '001', 'condition', 'good', '2021-11-02 01:00:00'),
(6, '001', 'condition', 'ok', '2021-11-03 01:00:00'),
(7, '001', 'feeling', 'sad', '2021-11-01 01:00:00'),
(8, '001', 'feeling', 'angry', '2021-11-02 01:00:00'),
(9, '001', 'feeling', 'fine', '2021-11-03 01:00:00'),
(10, '002', 'price', '1200', '2021-11-01 01:00:00'),
(11, '002', 'price', '1600', '2021-11-02 01:00:00'),
(12, '002', 'price', '2000', '2021-11-03 01:00:00'),
(13, '002', 'weather', 'sunny', '2021-11-01 01:00:00'),
(14, '002', 'weather', 'rain', '2021-11-02 01:00:00'),
(15, '002', 'price', '1900', '2021-11-04 01:00:00'),
(16, '002', 'feeling', 'sad', '2021-11-01 01:00:00'),
(17, '002', 'feeling', 'angry', '2021-11-02 01:00:00'),
(18, '002', 'feeling', 'fine', '2021-11-03 01:00:00'),
(19, '003', 'price', '1000', '2021-11-01 01:00:00'),
(20, '003', 'price', '1500', '2021-11-02 01:00:00'),
(21, '003', 'price', '2000', '2021-11-03 01:00:00'),
(22, '003', 'condition', 'ok', '2021-11-01 01:00:00'),
(23, '003', 'weather', 'rain', '2021-11-02 01:00:00'),
(24, '003', 'condition', 'bad', '2021-11-03 01:00:00'),
(25, '003', 'feeling', 'fine', '2021-11-01 01:00:00'),
(26, '003', 'weather', 'sunny', '2021-11-03 01:00:00'),
(27, '003', 'feeling', 'sad', '2021-11-03 01:00:00')
;
为了看清楚,我把上面的table按id和timestamp排序。没关系。
- 我们正在使用 Postgresql 版本:PostgreSQL 9.5.19
- 实际 table 包含超过 400 万行
- 项目 列包含 500 多个不同的项目,但请不要担心。我将最多使用 10 个项目进行查询。上面table我只用了4个
- 我们还有另一个名为 Customer_table 的 table,它具有包含客户一般信息的唯一 Customer_id。
根据上面的 table,我想查询数据以创建一个 table,其中包含如下所示的最新日期更新数据。我将最多使用 10 个项目进行查询,因此可能有 10 列。
customer_id price condition feeling weather .......(there may be other columns from item column)
002 1900 null fine rain
001 1400 ok fine null
003 2000 bad sad sunny
这是我从
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id)
所以,如果有更好的解决方案,请帮助我。 谢谢。
您可以尝试使用 row_number
的其他方法来生成一个值以根据最新数据过滤您的数据。然后,您可以根据所需的行号 rn=1
(我们将按降序排序)和项目名称过滤您的记录的案例表达式的最大值聚合客户 ID。
这些方法不那么冗长,而且根据在线结果来看,它们的性能似乎更高。在评论中让我知道如何在您的环境中复制它。
您可以使用 EXPLAIN ANALYZE
将此方法与当前方法进行比较。在线环境提供的结果:
当前方法
| Planning time: 0.129 ms
| Execution time: 0.056 ms
建议的方法 1
| Planning time: 0.061 ms
| Execution time: 0.070 ms
建议方法 2
| Planning time: 0.047 ms
| Execution time: 0.056 ms
注意。 您可以使用 EXPLAIN ANALYZE
在您的环境中比较这些我们无法在线复制的方法。每个 运行 的结果也可能不同。还建议在 item
列上使用索引和早期过滤器以提高性能。
架构 (PostgreSQL v9.5)
建议的方法 1
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='feeling' THEN t1.value END) as feeling,
MAX(CASE WHEN t1.item='weather' THEN t1.value END) as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id | conditio | price | feeling | weather |
---|---|---|---|---|
001 | ok | 1400 | fine | |
002 | 1900 | fine | rain | |
003 | bad | 2000 | sad | sunny |
建议方法 2
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='feeling') as feeling,
MAX(t1.value) FILTER (WHERE t1.item='weather') as weather
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
-- ensure that you filter based on your desired items
-- indexes on item column are recommended to improve performance
) t1
WHERE rn=1
GROUP BY
1;
customer_id | conditio | price | feeling | weather |
---|---|---|---|---|
001 | ok | 1400 | fine | |
002 | 1900 | fine | rain | |
003 | bad | 2000 | sad | sunny |
当前使用 EXPLAIN ANALYZE 的方法
EXPLAIN(ANALYZE,BUFFERS)
SELECT customer_id, p.value AS price, c.value AS condition
FROM (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'condition'
ORDER BY customer_id, timestamp DESC
) c
FULL JOIN (
SELECT DISTINCT ON (customer_id)
customer_id, value
FROM tbl
WHERE item = 'price'
ORDER BY customer_id, timestamp DESC
) p USING (customer_id);
QUERY PLAN |
---|
Merge Full Join (cost=35.05..35.12 rows=1 width=128) (actual time=0.025..0.030 rows=3 loops=1) |
Merge Cond: (tbl.customer_id = tbl_1.customer_id) |
Buffers: shared hit=2 |
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.013..0.014 rows=2 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.013..0.013 rows=5 loops=1) |
Sort Key: tbl.customer_id, tbl."timestamp" DESC |
Sort Method: quicksort Memory: 25kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl (cost=0.00..17.50 rows=3 width=72) (actual time=0.004..0.006 rows=5 loops=1) |
Filter: (item = 'condition'::text) |
Rows Removed by Filter: 22 |
Buffers: shared hit=1 |
-> Materialize (cost=17.52..17.55 rows=1 width=64) (actual time=0.010..0.013 rows=3 loops=1) |
Buffers: shared hit=1 |
-> Unique (cost=17.52..17.54 rows=1 width=72) (actual time=0.010..0.012 rows=3 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.52..17.53 rows=3 width=72) (actual time=0.010..0.010 rows=10 loops=1) |
Sort Key: tbl_1.customer_id, tbl_1."timestamp" DESC |
Sort Method: quicksort Memory: 25kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl tbl_1 (cost=0.00..17.50 rows=3 width=72) (actual time=0.001..0.003 rows=10 loops=1) |
Filter: (item = 'price'::text) |
Rows Removed by Filter: 17 |
Buffers: shared hit=1 |
Planning time: 0.129 ms |
Execution time: 0.056 ms |
使用 EXPLAIN ANALYZE 的建议方法 1
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(CASE WHEN t1.item='price' THEN t1.value END) as price,
MAX(CASE WHEN t1.item='condition' THEN t1.value END) as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN |
---|
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.039..0.047 rows=3 loops=1) |
Group Key: t1.customer_id |
Buffers: shared hit=1 |
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.030..0.040 rows=5 loops=1) |
Filter: (t1.rn = 1) |
Rows Removed by Filter: 10 |
Buffers: shared hit=1 |
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.029..0.038 rows=15 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.028..0.030 rows=15 loops=1) |
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC |
Sort Method: quicksort Memory: 26kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1) |
Filter: (item = ANY ('{price,condition}'::text[])) |
Rows Removed by Filter: 12 |
Buffers: shared hit=1 |
Planning time: 0.061 ms |
Execution time: 0.070 ms |
使用 EXPLAIN ANALYZE 的建议方法 2
EXPLAIN(ANALYZE,BUFFERS)
SELECT
t1.customer_id,
MAX(t1.value) FILTER (WHERE t1.item='price') as price,
MAX(t1.value) FILTER (WHERE t1.item='condition') as conditio
FROM (
SELECT
* ,
ROW_NUMBER() OVER (
PARTITION BY customer_id,item
ORDER BY tbl.timestamp DESC
) as rn
FROM
tbl
where item IN ('price','condition')
) t1
WHERE rn=1
GROUP BY
1;
QUERY PLAN |
---|
GroupAggregate (cost=17.58..17.81 rows=1 width=96) (actual time=0.029..0.037 rows=3 loops=1) |
Group Key: t1.customer_id |
Buffers: shared hit=1 |
-> Subquery Scan on t1 (cost=17.58..17.79 rows=1 width=96) (actual time=0.021..0.032 rows=5 loops=1) |
Filter: (t1.rn = 1) |
Rows Removed by Filter: 10 |
Buffers: shared hit=1 |
-> WindowAgg (cost=17.58..17.71 rows=6 width=104) (actual time=0.021..0.030 rows=15 loops=1) |
Buffers: shared hit=1 |
-> Sort (cost=17.58..17.59 rows=6 width=104) (actual time=0.019..0.021 rows=15 loops=1) |
Sort Key: tbl.customer_id, tbl.item, tbl."timestamp" DESC |
Sort Method: quicksort Memory: 26kB |
Buffers: shared hit=1 |
-> Seq Scan on tbl (cost=0.00..17.50 rows=6 width=104) (actual time=0.003..0.008 rows=15 loops=1) |
Filter: (item = ANY ('{price,condition}'::text[])) |
Rows Removed by Filter: 12 |
Buffers: shared hit=1 |
Planning time: 0.047 ms |
Execution time: 0.056 ms |
你操作了一个大table。你提到了 400 万行,显然还在增长。在查询 ...
- 所有客户
- 所有项目
- 每个
(customer_id, item)
几行
- 窄行(小行大小)
... row_number()
很棒。也很短。
整个table必须在顺序扫描中处理。不会使用索引。
但更喜欢使用现代聚合 FILTER
语法的“方法 2”。它更清晰,更快。在此处查看性能测试:
方法 3:以 crosstab()
为中心
crosstab()
通常更快,尤其是对于多个项目。参见:
- PostgreSQL Crosstab Query
SELECT *
FROM crosstab(
$$
SELECT customer_id, item, value
FROM (
SELECT customer_id, item, value
, row_number() OVER (PARTITION BY customer_id, item ORDER BY t.timestamp DESC) AS rn
FROM tbl t
WHERE item = ANY ('{condition,price,feeling,weather}') -- your items here ...
) t1
WHERE rn = 1
ORDER BY customer_id, item
$$
, $$SELECT unnest('{condition,price,feeling,weather}'::text[])$$ -- ... here ...
) AS ct (customer_id text, condition text, price text, feeling text, weather text); -- ... and here ...
方法 4:LATERAL
子查询
如果顶部列出的一个或多个条件不适用,上述查询的性能会迅速下降。
对于初学者来说,最多只涉及“500 个不同的项目”中的 10 个。那是大 table 的最大 ~ 2%。相比之下,仅此一项就可以使以下查询之一(快得多):
SELECT *
FROM (SELECT customer_id FROM customer) c
LEFT JOIN LATERAL (
SELECT value AS condition
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'condition'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t1 ON true
LEFT JOIN LATERAL (
SELECT value AS price
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'price'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t2 ON true
LEFT JOIN LATERAL (
SELECT value AS feeling
FROM tbl t
WHERE t.customer_id = c.customer_id
AND t.item = 'feeling'
ORDER BY t.timestamp DESC
LIMIT 1
) AS t3 ON true
-- ... more?
关于LEFT JOIN LATERAL
:
重点是获得一个具有相对较少索引(仅)扫描的查询计划,以取代大 table 上昂贵的顺序扫描。
需要一个适用的index,显然:
CREATE INDEX ON tbl (customer_id, item);
或更好(在 Postgres 9.5 中):
CREATE INDEX ON tbl (customer_id, item, timestamp DESC, value);
在 Postgres 11 或更高版本中,这会更好,但是:
CREATE INDEX ON tbl (customer_id, item, timestamp DESC) INCLUDE (value);
如果只有少数项目感兴趣,这些项目的部分索引会更好。
方法 5:相关子查询
SELECT c.customer_id
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'condition' ORDER BY t.timestamp DESC LIMIT 1) AS condition
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'price' ORDER BY t.timestamp DESC LIMIT 1) AS price
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'feeling' ORDER BY t.timestamp DESC LIMIT 1) AS feeling
, (SELECT value FROM tbl t WHERE t.customer_id = c.customer_id AND t.item = 'weather' ORDER BY t.timestamp DESC LIMIT 1) AS weather
FROM customer c;
不如 LATERAL
多才多艺,但足以满足此目的。与方法 4 相同的索引要求。
方法 5 将是 大多数情况下的性能之王。
db<>fiddle here
改进你的关系设计and/or升级到当前版本的 Postgres 也会有很长的路要走。