慢计数终于解决了
Slow counts finally solved
tl;dr:|> Repo.aggregate(:count, :id)
很慢,请使用 |> Repo.aggregate(:count)
我是 运行 拥有超过 500 万集的播客数据库。存储新剧集后,我会像这样计算给定播客的剧集以用于计数器缓存:
episodes_count = where(Episode, podcast_id: ^podcast_id)
|> Repo.aggregate(:count, :id)
事实证明这变得越来越慢。所以我开始深入挖掘,然后我意识到,
在 Postgres 12 中只有 SELECT COUNT(*)
只进行索引扫描,而 SELECT COUNT(e0.id)
不会。
对于冷数据库(刚刚重新启动),即使第一次索引扫描也相当快:
postgres=# \c pan_prod
Sie sind jetzt verbunden mit der Datenbank »pan_prod« als Benutzer »postgres«.
pan_prod=# EXPLAIN ANALYZE SELECT count(*) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=348.51..348.52 rows=1 width=8) (actual time=15.823..15.823 rows=1 loops=1)
-> Index Only Scan using episodes_podcast_id_index on episodes e0 (cost=0.43..323.00 rows=10204 width=0) (actual time=1.331..14.832 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Heap Fetches: 0
Planning Time: 2.994 ms
Execution Time: 16.017 ms
第二次扫描速度甚至更快:
pan_prod=# EXPLAIN ANALYZE SELECT count(*) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=348.51..348.52 rows=1 width=8) (actual time=5.007..5.008 rows=1 loops=1)
-> Index Only Scan using episodes_podcast_id_index on episodes e0 (cost=0.43..323.00 rows=10204 width=0) (actual time=0.042..3.548 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Heap Fetches: 0
Planning Time: 0.304 ms
Execution Time: 5.074 ms
虽然第一个位图堆扫描非常慢:
pan_prod=# EXPLAIN ANALYZE SELECT count(e0.id) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=37181.71..37181.72 rows=1 width=8) (actual time=4098.525..4098.526 rows=1 loops=1)
-> Bitmap Heap Scan on episodes e0 (cost=219.51..37156.20 rows=10204 width=4) (actual time=6.508..4082.558 rows=10613 loops=1)
Recheck Cond: (podcast_id = 35202)
Heap Blocks: exact=6516
-> Bitmap Index Scan on episodes_podcast_id_index (cost=0.00..216.96 rows=10204 width=0) (actual time=3.657..3.658 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Planning Time: 0.412 ms
Execution Time: 4098.719 ms
第二个通常更快:
pan_prod=# EXPLAIN ANALYZE SELECT count(e0.id) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=37181.71..37181.72 rows=1 width=8) (actual time=18.857..18.857 rows=1 loops=1)
-> Bitmap Heap Scan on episodes e0 (cost=219.51..37156.20 rows=10204 width=4) (actual time=6.047..17.152 rows=10613 loops=1)
Recheck Cond: (podcast_id = 35202)
Heap Blocks: exact=6516
-> Bitmap Index Scan on episodes_podcast_id_index (cost=0.00..216.96 rows=10204 width=0) (actual time=3.738..3.738 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Planning Time: 0.322 ms
Execution Time: 18.999 ms
我不明白,为什么SELECT count(e0.id)
不使用索引,我想知道为什么。
我一直想,我应该更喜欢它,因为只看了一个栏目,但似乎不是这样。
I don't get, why SELECT count(e0.id) doesn't use the index, and I would like to know why. I always thought, I should prefer it, as only one column is looked at, but that is not the case, it seems.
它确实两次使用索引episodes_podcast_id_index
。
一旦它可以不用额外查看 table,其他时候它会查看 table。
您没有提供索引定义,但 id
似乎不在其中。我还认为它不是 not null
可用列(它可能包含 null
值)。因此,数据库必须获取 id
列以查看它是否为 null
.
为什么?
因为 count(*)
计算行数,而 count(<expression>)
计算非空值。
tl;dr:|> Repo.aggregate(:count, :id)
很慢,请使用 |> Repo.aggregate(:count)
我是 运行 拥有超过 500 万集的播客数据库。存储新剧集后,我会像这样计算给定播客的剧集以用于计数器缓存:
episodes_count = where(Episode, podcast_id: ^podcast_id)
|> Repo.aggregate(:count, :id)
事实证明这变得越来越慢。所以我开始深入挖掘,然后我意识到,
在 Postgres 12 中只有 SELECT COUNT(*)
只进行索引扫描,而 SELECT COUNT(e0.id)
不会。
对于冷数据库(刚刚重新启动),即使第一次索引扫描也相当快:
postgres=# \c pan_prod
Sie sind jetzt verbunden mit der Datenbank »pan_prod« als Benutzer »postgres«.
pan_prod=# EXPLAIN ANALYZE SELECT count(*) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=348.51..348.52 rows=1 width=8) (actual time=15.823..15.823 rows=1 loops=1)
-> Index Only Scan using episodes_podcast_id_index on episodes e0 (cost=0.43..323.00 rows=10204 width=0) (actual time=1.331..14.832 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Heap Fetches: 0
Planning Time: 2.994 ms
Execution Time: 16.017 ms
第二次扫描速度甚至更快:
pan_prod=# EXPLAIN ANALYZE SELECT count(*) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=348.51..348.52 rows=1 width=8) (actual time=5.007..5.008 rows=1 loops=1)
-> Index Only Scan using episodes_podcast_id_index on episodes e0 (cost=0.43..323.00 rows=10204 width=0) (actual time=0.042..3.548 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Heap Fetches: 0
Planning Time: 0.304 ms
Execution Time: 5.074 ms
虽然第一个位图堆扫描非常慢:
pan_prod=# EXPLAIN ANALYZE SELECT count(e0.id) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=37181.71..37181.72 rows=1 width=8) (actual time=4098.525..4098.526 rows=1 loops=1)
-> Bitmap Heap Scan on episodes e0 (cost=219.51..37156.20 rows=10204 width=4) (actual time=6.508..4082.558 rows=10613 loops=1)
Recheck Cond: (podcast_id = 35202)
Heap Blocks: exact=6516
-> Bitmap Index Scan on episodes_podcast_id_index (cost=0.00..216.96 rows=10204 width=0) (actual time=3.657..3.658 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Planning Time: 0.412 ms
Execution Time: 4098.719 ms
第二个通常更快:
pan_prod=# EXPLAIN ANALYZE SELECT count(e0.id) FROM "episodes" AS e0 WHERE (e0."podcast_id" = 35202);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=37181.71..37181.72 rows=1 width=8) (actual time=18.857..18.857 rows=1 loops=1)
-> Bitmap Heap Scan on episodes e0 (cost=219.51..37156.20 rows=10204 width=4) (actual time=6.047..17.152 rows=10613 loops=1)
Recheck Cond: (podcast_id = 35202)
Heap Blocks: exact=6516
-> Bitmap Index Scan on episodes_podcast_id_index (cost=0.00..216.96 rows=10204 width=0) (actual time=3.738..3.738 rows=10613 loops=1)
Index Cond: (podcast_id = 35202)
Planning Time: 0.322 ms
Execution Time: 18.999 ms
我不明白,为什么SELECT count(e0.id)
不使用索引,我想知道为什么。
我一直想,我应该更喜欢它,因为只看了一个栏目,但似乎不是这样。
I don't get, why SELECT count(e0.id) doesn't use the index, and I would like to know why. I always thought, I should prefer it, as only one column is looked at, but that is not the case, it seems.
它确实两次使用索引episodes_podcast_id_index
。
一旦它可以不用额外查看 table,其他时候它会查看 table。
您没有提供索引定义,但 id
似乎不在其中。我还认为它不是 not null
可用列(它可能包含 null
值)。因此,数据库必须获取 id
列以查看它是否为 null
.
为什么?
因为 count(*)
计算行数,而 count(<expression>)
计算非空值。