max() 与 ORDER BY DESC + LIMIT 1 的性能对比
Performance of max() vs ORDER BY DESC + LIMIT 1
我今天正在对一些缓慢的 SQL 查询进行故障排除,不太了解下面的性能差异:
当尝试根据某些条件从数据 table 中提取 max(timestamp)
时,如果存在匹配行,则使用 MAX()
比 ORDER BY timestamp LIMIT 1
慢,但相当大如果没有找到匹配的行,速度会更快。
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)
Time: 1314.544 ms
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)
Time: 10.890 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms
(timestamp)
和 (sensor_id, timestamp)
上有索引,我注意到 Postgres 在这两种情况下使用非常不同的查询计划和索引:
QUERY PLAN (ORDER BY)
--------------------------------------------------------------------------------------------------------
Limit (cost=0.43..9.47 rows=1 width=8)
-> Nested Loop (cost=0.43..396254.63 rows=43823 width=8)
Join Filter: (data.sensor_id = sensors.id)
-> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12)
-> Materialize (cost=0.00..6.70 rows=2 width=4)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
(7 rows)
QUERY PLAN (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3680.59..3680.60 rows=1 width=8)
-> Nested Loop (cost=0.43..3571.03 rows=43823 width=8)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12)
Index Cond: (sensor_id = sensors.id)
(6 rows)
所以我的两个问题是:
- 这种性能差异从何而来?我在这里 MIN/MAX vs ORDER BY and LIMIT 看到了公认的答案,但这似乎并不适用于此。任何好的资源将不胜感激。
- 是否有比添加
EXISTS
检查更好的方法来提高所有情况下的性能(匹配行与无匹配行)?
编辑 以解决以下评论中的问题。我保留了上面的初始查询计划以供将来参考:
Table定义:
Table "public.sensors"
Column | Type | Modifiers
----------------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('sensors_id_seq'::regclass)
station_id | integer | not null
....
Indexes:
"sensor_primary" PRIMARY KEY, btree (id)
"ind_station_id" btree (station_id, id)
"ind_station" btree (station_id)
Table "public.data"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------------
id | integer | not null default nextval('data_id_seq'::regclass)
timestamp | timestamp with time zone | not null
sensor_id | integer | not null
avg | integer |
Indexes:
"timestamp_ind" btree ("timestamp" DESC)
"sensor_ind" btree (sensor_id)
"sensor_ind_timestamp" btree (sensor_id, "timestamp")
"sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)
请注意,我刚刚在下面@Erwin 的建议之后在 sensors
上添加了 ind_station_id
。时间并没有真正发生巨大变化,在 ORDER BY DESC + LIMIT 1
情况下仍然是 >1200ms
,在 MAX
情况下仍然是 ~0.9ms
。
查询计划:
QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
Buffers: shared hit=3418066 read=47326
-> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
Join Filter: (data.sensor_id = sensors.id)
Buffers: shared hit=3418066 read=47326
-> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
Buffers: shared hit=3418065 read=47326
-> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)
QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
Buffers: shared hit=1
-> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed)
Index Cond: (sensor_id = sensors.id)
Heap Fetches: 0
Planning time: 0.435 ms
Execution time: 0.048 ms
(13 rows)
所以就像前面解释的那样,ORDER BY
做了一个 Scan using timestamp_in on data
,而在 MAX
的情况下没有做。
Postgres 版本:
来自 Ubuntu 回购的 Postgres:PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
请注意,存在 NOT NULL
约束,因此 ORDER BY
不必对空行进行排序。
另请注意,我对差异的来源非常感兴趣。虽然不理想,但我可以使用 EXISTS (<1ms)
然后 SELECT (~11ms)
.
相对快速地检索数据
查询计划显示索引名称 timestamp_ind
和 timestamp_sensor_ind
。但是这样的索引对搜索特定传感器没有帮助。
要解析等号查询(如 sensor.id = data.sensor_id
),该列必须是索引中的第一个。尝试添加一个允许在 sensor_id
上搜索的索引,并在传感器内按时间戳排序:
create index sensor_timestamp_ind on data(sensor_id, timestamp);
添加该索引是否会加快查询速度?
sensor.station_id
上似乎没有索引,这很可能在这里很重要。
max()
和 ORDER BY DESC + LIMIT 1
之间存在实际 差异 。很多人似乎都忽略了这一点。 NULL 值按降序排序 first。所以 ORDER BY timestamp DESC LIMIT 1
returns 一行 timestamp IS NULL
如果它存在,而聚合函数 max()
忽略 NULL 值和 returns 最新的非空时间戳。
对于您的情况,由于您的列 d.timestamp
被定义为 NOT NULL
(正如您的更新显示的那样),没有有效差异。带有 DESC NULLS LAST
的索引和 ORDER BY
中用于 LIMIT
查询的相同子句应该仍然能为您提供最好的服务。我建议这些 indexes(我下面的查询基于第二个):
sensor(station_id, id)
data(sensor_id, timestamp <b>DESC NULLS LAST</b>)
您可以删除其他索引变体 sensor_ind_timestamp
和 sensor_ind_timestamp_desc
除非您还有其他查询需要它们(不太可能,但可能)。
更重要的是,还有一个难点:第一个tablesensors
returns的过滤器很少,但仍然(可能) 多行。 Postgres 期望 在您添加的 EXPLAIN
输出中找到 2 行 (rows=2
)。
完美的技术是 松散索引扫描 第二个 table data
- 目前尚未在 Postgres 中实现9.4(或 Postgres 9.5)。您可以重写查询以通过多种方式解决此限制。详情:
- Optimize GROUP BY query to retrieve latest record per user
最好的应该是:
SELECT d.timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4
ORDER BY d.timestamp DESC NULLS LAST
LIMIT 1;
由于外部查询的风格大多是无关紧要的,你也可以只:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4;
max()
变体现在的执行速度应该差不多:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT max(timestamp) AS timestamp
FROM data
WHERE sensor_id = s.id
) d
WHERE s.station_id = 4;
甚至,最短:
SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM sensors s
WHERE station_id = 4;
注意双括号!
LIMIT
在 LATERAL
连接中的额外优势是您可以检索所选行的任意列,而不仅仅是最新的时间戳(一列)。
相关:
- Why do NULL values come first when ordering DESC in a PostgreSQL query?
- Select first row in each GROUP BY group?
- Optimize groupwise maximum query
我今天正在对一些缓慢的 SQL 查询进行故障排除,不太了解下面的性能差异:
当尝试根据某些条件从数据 table 中提取 max(timestamp)
时,如果存在匹配行,则使用 MAX()
比 ORDER BY timestamp LIMIT 1
慢,但相当大如果没有找到匹配的行,速度会更快。
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)
Time: 1314.544 ms
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)
Time: 10.890 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms
(timestamp)
和 (sensor_id, timestamp)
上有索引,我注意到 Postgres 在这两种情况下使用非常不同的查询计划和索引:
QUERY PLAN (ORDER BY)
--------------------------------------------------------------------------------------------------------
Limit (cost=0.43..9.47 rows=1 width=8)
-> Nested Loop (cost=0.43..396254.63 rows=43823 width=8)
Join Filter: (data.sensor_id = sensors.id)
-> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12)
-> Materialize (cost=0.00..6.70 rows=2 width=4)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
(7 rows)
QUERY PLAN (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3680.59..3680.60 rows=1 width=8)
-> Nested Loop (cost=0.43..3571.03 rows=43823 width=8)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12)
Index Cond: (sensor_id = sensors.id)
(6 rows)
所以我的两个问题是:
- 这种性能差异从何而来?我在这里 MIN/MAX vs ORDER BY and LIMIT 看到了公认的答案,但这似乎并不适用于此。任何好的资源将不胜感激。
- 是否有比添加
EXISTS
检查更好的方法来提高所有情况下的性能(匹配行与无匹配行)?
编辑 以解决以下评论中的问题。我保留了上面的初始查询计划以供将来参考:
Table定义:
Table "public.sensors"
Column | Type | Modifiers
----------------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('sensors_id_seq'::regclass)
station_id | integer | not null
....
Indexes:
"sensor_primary" PRIMARY KEY, btree (id)
"ind_station_id" btree (station_id, id)
"ind_station" btree (station_id)
Table "public.data"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------------
id | integer | not null default nextval('data_id_seq'::regclass)
timestamp | timestamp with time zone | not null
sensor_id | integer | not null
avg | integer |
Indexes:
"timestamp_ind" btree ("timestamp" DESC)
"sensor_ind" btree (sensor_id)
"sensor_ind_timestamp" btree (sensor_id, "timestamp")
"sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)
请注意,我刚刚在下面@Erwin 的建议之后在 sensors
上添加了 ind_station_id
。时间并没有真正发生巨大变化,在 ORDER BY DESC + LIMIT 1
情况下仍然是 >1200ms
,在 MAX
情况下仍然是 ~0.9ms
。
查询计划:
QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
Buffers: shared hit=3418066 read=47326
-> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
Join Filter: (data.sensor_id = sensors.id)
Buffers: shared hit=3418066 read=47326
-> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
Buffers: shared hit=3418065 read=47326
-> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)
QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
Buffers: shared hit=1
-> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed)
Index Cond: (sensor_id = sensors.id)
Heap Fetches: 0
Planning time: 0.435 ms
Execution time: 0.048 ms
(13 rows)
所以就像前面解释的那样,ORDER BY
做了一个 Scan using timestamp_in on data
,而在 MAX
的情况下没有做。
Postgres 版本:
来自 Ubuntu 回购的 Postgres:PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
请注意,存在 NOT NULL
约束,因此 ORDER BY
不必对空行进行排序。
另请注意,我对差异的来源非常感兴趣。虽然不理想,但我可以使用 EXISTS (<1ms)
然后 SELECT (~11ms)
.
查询计划显示索引名称 timestamp_ind
和 timestamp_sensor_ind
。但是这样的索引对搜索特定传感器没有帮助。
要解析等号查询(如 sensor.id = data.sensor_id
),该列必须是索引中的第一个。尝试添加一个允许在 sensor_id
上搜索的索引,并在传感器内按时间戳排序:
create index sensor_timestamp_ind on data(sensor_id, timestamp);
添加该索引是否会加快查询速度?
sensor.station_id
上似乎没有索引,这很可能在这里很重要。
max()
和 ORDER BY DESC + LIMIT 1
之间存在实际 差异 。很多人似乎都忽略了这一点。 NULL 值按降序排序 first。所以 ORDER BY timestamp DESC LIMIT 1
returns 一行 timestamp IS NULL
如果它存在,而聚合函数 max()
忽略 NULL 值和 returns 最新的非空时间戳。
对于您的情况,由于您的列 d.timestamp
被定义为 NOT NULL
(正如您的更新显示的那样),没有有效差异。带有 DESC NULLS LAST
的索引和 ORDER BY
中用于 LIMIT
查询的相同子句应该仍然能为您提供最好的服务。我建议这些 indexes(我下面的查询基于第二个):
sensor(station_id, id)
data(sensor_id, timestamp <b>DESC NULLS LAST</b>)
您可以删除其他索引变体 和 sensor_ind_timestamp
除非您还有其他查询需要它们(不太可能,但可能)。sensor_ind_timestamp_desc
更重要的是,还有一个难点:第一个tablesensors
returns的过滤器很少,但仍然(可能) 多行。 Postgres 期望 在您添加的 EXPLAIN
输出中找到 2 行 (rows=2
)。
完美的技术是 松散索引扫描 第二个 table data
- 目前尚未在 Postgres 中实现9.4(或 Postgres 9.5)。您可以重写查询以通过多种方式解决此限制。详情:
- Optimize GROUP BY query to retrieve latest record per user
最好的应该是:
SELECT d.timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4
ORDER BY d.timestamp DESC NULLS LAST
LIMIT 1;
由于外部查询的风格大多是无关紧要的,你也可以只:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4;
max()
变体现在的执行速度应该差不多:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT max(timestamp) AS timestamp
FROM data
WHERE sensor_id = s.id
) d
WHERE s.station_id = 4;
甚至,最短:
SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM sensors s
WHERE station_id = 4;
注意双括号!
LIMIT
在 LATERAL
连接中的额外优势是您可以检索所选行的任意列,而不仅仅是最新的时间戳(一列)。
相关:
- Why do NULL values come first when ordering DESC in a PostgreSQL query?
- Select first row in each GROUP BY group?
- Optimize groupwise maximum query