为什么 LIMIT 2 查询有效但 LIMIT 1 总是超时?
Why would LIMIT 2 queries work but LIMIT 1 always times out?
我正在使用这个 public NEAR 协议的 Postgres 数据库:https://github.com/near/near-indexer-for-explorer#shared-public-access
postgres://public_readonly:nearprotocol@mainnet.db.explorer.indexer.near.dev/mainnet_explorer
SELECT "public"."receipts"."receipt_id",
"public"."receipts"."included_in_block_hash",
"public"."receipts"."included_in_chunk_hash",
"public"."receipts"."index_in_chunk",
"public"."receipts"."included_in_block_timestamp",
"public"."receipts"."predecessor_account_id",
"public"."receipts"."receiver_account_id",
"public"."receipts"."receipt_kind",
"public"."receipts"."originated_from_transaction_hash"
FROM "public"."receipts"
WHERE ("public"."receipts"."receipt_id") IN
(SELECT "t0"."receipt_id"
FROM "public"."receipts" AS "t0"
INNER JOIN "public"."action_receipts" AS "j0" ON ("j0"."receipt_id") = ("t0"."receipt_id")
WHERE ("j0"."signer_account_id" = 'ryancwalsh.near'
AND "t0"."receipt_id" IS NOT NULL))
ORDER BY "public"."receipts"."included_in_block_timestamp" DESC
LIMIT 1
OFFSET 0
总是导致:
ERROR: canceling statement due to statement timeout
SQL state: 57014
但如果我将其更改为 LIMIT 2,查询将在不到 1 秒内运行。
怎么会是这样呢?这是否意味着数据库设置不正确?还是我做错了什么?
P.S。这里的查询是通过 Prisma 生成的。 findFirst
总是超时,所以我想我可能需要将其更改为 findMany
作为解决方法。
您的查询可以简化/优化:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM public.receipts r
WHERE EXISTS (
SELECT FROM public.action_receipts j
WHERE j.receipt_id = r.receipt_id
AND j.signer_account_id = 'ryancwalsh.near'
)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
但是,这只是触及您潜在问题的表面。
就像 Kirk 已经评论过的那样,Postgres 为 LIMIT 1
选择了一个不同的查询计划,显然不知道 只有 90 行 在 table 中action_receipts
和 signer_account_id = 'ryancwalsh.near'
,而涉及的两个 table 都有超过 2.2 亿行,显然在稳步增长。
更改为 LIMIT 2
使不同的查询计划看起来更有利,因此观察到性能差异。 (所以查询规划器的一般想法是过滤器非常有选择性,只是对于 LIMIT 1
的极端情况不够接近。)
您应该提到 基数 让我们走上正轨。
知道我们的过滤器是如此有选择性,我们可以用不同的查询强制执行更有利的查询计划:
WITH j AS (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
)
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
这对 LIMIT 1
使用相同的查询计划,并且在我的测试中都在 2 毫秒内完成:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=134904.89..134904.89 rows=1 width=223) (actual time=1.750..1.754 rows=1 loops=1)
CTE j
-> Bitmap Heap Scan on action_receipts (cost=319.46..41564.59 rows=10696 width=44) (actual time=0.058..0.179 rows=90 loops=1)
Recheck Cond: (signer_account_id = 'ryancwalsh.near'::text)
Heap Blocks: exact=73
-> Bitmap Index Scan on action_receipt_signer_account_id_idx (cost=0.00..316.79 rows=10696 width=0) (actual time=0.043..0.043 rows=90 loops=1)
Index Cond: (signer_account_id = 'ryancwalsh.near'::text)
-> Sort (cost=93340.30..93367.04 rows=10696 width=223) (actual time=1.749..1.750 rows=1 loops=1)
Sort Key: r.included_in_block_timestamp DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.70..93286.82 rows=10696 width=223) (actual time=0.089..1.705 rows=90 loops=1)
-> CTE Scan on j (cost=0.00..213.92 rows=10696 width=32) (actual time=0.060..0.221 rows=90 loops=1)
-> Index Scan using receipts_pkey on receipts r (cost=0.70..8.70 rows=1 width=223) (actual time=0.016..0.016 rows=1 loops=90)
Index Cond: (receipt_id = j.receipt_id)
Planning Time: 0.281 ms
Execution Time: 1.801 ms
要点是首先在 CTE 中执行高度选择性的查询。然后 Postgres 不会在错误的假设下尝试遍历 (included_in_block_timestamp)
上的索引,因为它会很快找到匹配的行。 (不是。)
手头的数据库运行 Postgres 11,其中 CTE 始终是优化障碍。在 Postgres 12 或更高版本中,将 AS MATERIALIZED
添加到 CTE 以保证相同的效果。
或者您可以在任何版本中使用 "OFFSET 0 hack":
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
OFFSET 0 -- !
) j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
防止子查询的“内联”达到同样的效果。在 < 2 毫秒内完成。
参见:
“修复”数据库?
正确的修复取决于完整的图片。潜在的问题是 Postgres 高估了 table action_receipts
中符合条件的行数。 MCV列表(most common values)跟不上2.2亿行(和生长)。很可能不仅仅是 ANALYZE
落后。 (虽然它可能是:autovacuum
没有正确配置?菜鸟错误?)根据 action_receipts.signer_account_id
中的实际基数(数据分布)和访问模式,您可以做各种事情来“修复”它。两个选项:
1。增加n_distinct
和STATISTICS
如果 action_receipts.signer_account_id
中的大多数值同样罕见(高基数),请考虑为该列设置一个非常大的 n_distinct
值。并将其与同一列的适度增加的 STATISTICS
目标相结合,以抵消另一个方向的错误(低于 估计公共值的合格行数)。在这里阅读两个答案:
并且:
2。使用 partial index
进行本地修复
If action_receipts.signer_account_id = 'ryancwalsh.near'
的特殊之处在于它比其他查询更频繁地查询,请考虑为它使用一个小的部分索引,以解决这种情况。喜欢:
CREATE INDEX ON action_receipts (receipt_id)
WHERE signer_account_id = 'ryancwalsh.near';
我正在使用这个 public NEAR 协议的 Postgres 数据库:https://github.com/near/near-indexer-for-explorer#shared-public-access
postgres://public_readonly:nearprotocol@mainnet.db.explorer.indexer.near.dev/mainnet_explorer
SELECT "public"."receipts"."receipt_id",
"public"."receipts"."included_in_block_hash",
"public"."receipts"."included_in_chunk_hash",
"public"."receipts"."index_in_chunk",
"public"."receipts"."included_in_block_timestamp",
"public"."receipts"."predecessor_account_id",
"public"."receipts"."receiver_account_id",
"public"."receipts"."receipt_kind",
"public"."receipts"."originated_from_transaction_hash"
FROM "public"."receipts"
WHERE ("public"."receipts"."receipt_id") IN
(SELECT "t0"."receipt_id"
FROM "public"."receipts" AS "t0"
INNER JOIN "public"."action_receipts" AS "j0" ON ("j0"."receipt_id") = ("t0"."receipt_id")
WHERE ("j0"."signer_account_id" = 'ryancwalsh.near'
AND "t0"."receipt_id" IS NOT NULL))
ORDER BY "public"."receipts"."included_in_block_timestamp" DESC
LIMIT 1
OFFSET 0
总是导致:
ERROR: canceling statement due to statement timeout
SQL state: 57014
但如果我将其更改为 LIMIT 2,查询将在不到 1 秒内运行。
怎么会是这样呢?这是否意味着数据库设置不正确?还是我做错了什么?
P.S。这里的查询是通过 Prisma 生成的。 findFirst
总是超时,所以我想我可能需要将其更改为 findMany
作为解决方法。
您的查询可以简化/优化:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM public.receipts r
WHERE EXISTS (
SELECT FROM public.action_receipts j
WHERE j.receipt_id = r.receipt_id
AND j.signer_account_id = 'ryancwalsh.near'
)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
但是,这只是触及您潜在问题的表面。
就像 Kirk 已经评论过的那样,Postgres 为 LIMIT 1
选择了一个不同的查询计划,显然不知道 只有 90 行 在 table 中action_receipts
和 signer_account_id = 'ryancwalsh.near'
,而涉及的两个 table 都有超过 2.2 亿行,显然在稳步增长。
更改为 LIMIT 2
使不同的查询计划看起来更有利,因此观察到性能差异。 (所以查询规划器的一般想法是过滤器非常有选择性,只是对于 LIMIT 1
的极端情况不够接近。)
您应该提到 基数 让我们走上正轨。
知道我们的过滤器是如此有选择性,我们可以用不同的查询强制执行更有利的查询计划:
WITH j AS (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
)
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
这对 LIMIT 1
使用相同的查询计划,并且在我的测试中都在 2 毫秒内完成:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=134904.89..134904.89 rows=1 width=223) (actual time=1.750..1.754 rows=1 loops=1)
CTE j
-> Bitmap Heap Scan on action_receipts (cost=319.46..41564.59 rows=10696 width=44) (actual time=0.058..0.179 rows=90 loops=1)
Recheck Cond: (signer_account_id = 'ryancwalsh.near'::text)
Heap Blocks: exact=73
-> Bitmap Index Scan on action_receipt_signer_account_id_idx (cost=0.00..316.79 rows=10696 width=0) (actual time=0.043..0.043 rows=90 loops=1)
Index Cond: (signer_account_id = 'ryancwalsh.near'::text)
-> Sort (cost=93340.30..93367.04 rows=10696 width=223) (actual time=1.749..1.750 rows=1 loops=1)
Sort Key: r.included_in_block_timestamp DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.70..93286.82 rows=10696 width=223) (actual time=0.089..1.705 rows=90 loops=1)
-> CTE Scan on j (cost=0.00..213.92 rows=10696 width=32) (actual time=0.060..0.221 rows=90 loops=1)
-> Index Scan using receipts_pkey on receipts r (cost=0.70..8.70 rows=1 width=223) (actual time=0.016..0.016 rows=1 loops=90)
Index Cond: (receipt_id = j.receipt_id)
Planning Time: 0.281 ms
Execution Time: 1.801 ms
要点是首先在 CTE 中执行高度选择性的查询。然后 Postgres 不会在错误的假设下尝试遍历 (included_in_block_timestamp)
上的索引,因为它会很快找到匹配的行。 (不是。)
手头的数据库运行 Postgres 11,其中 CTE 始终是优化障碍。在 Postgres 12 或更高版本中,将 AS MATERIALIZED
添加到 CTE 以保证相同的效果。
或者您可以在任何版本中使用 "OFFSET 0 hack":
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
OFFSET 0 -- !
) j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
防止子查询的“内联”达到同样的效果。在 < 2 毫秒内完成。
参见:
“修复”数据库?
正确的修复取决于完整的图片。潜在的问题是 Postgres 高估了 table action_receipts
中符合条件的行数。 MCV列表(most common values)跟不上2.2亿行(和生长)。很可能不仅仅是 ANALYZE
落后。 (虽然它可能是:autovacuum
没有正确配置?菜鸟错误?)根据 action_receipts.signer_account_id
中的实际基数(数据分布)和访问模式,您可以做各种事情来“修复”它。两个选项:
1。增加n_distinct
和STATISTICS
如果 action_receipts.signer_account_id
中的大多数值同样罕见(高基数),请考虑为该列设置一个非常大的 n_distinct
值。并将其与同一列的适度增加的 STATISTICS
目标相结合,以抵消另一个方向的错误(低于 估计公共值的合格行数)。在这里阅读两个答案:
并且:
2。使用 partial index
进行本地修复If action_receipts.signer_account_id = 'ryancwalsh.near'
的特殊之处在于它比其他查询更频繁地查询,请考虑为它使用一个小的部分索引,以解决这种情况。喜欢:
CREATE INDEX ON action_receipts (receipt_id)
WHERE signer_account_id = 'ryancwalsh.near';