Postgres:为什么添加索引会减慢正则表达式查询的速度?
Postgres: Why did adding index slow down regexp queries?
我在 Postgres 中有一个 TEXT keyvalues
列:
select * from test5 limit 5;
id | keyvalues
----+------------------------------------------------------
1 | ^ first 1 | second 3
2 | ^ first 1 | second 2 ^ first 2 | second 3
3 | ^ first 1 | second 2 | second 3
4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
5 | ^ first 2 | second 3 ^ first 1 | second 3
我的查询必须从匹配中间排除 ^
字符,所以我使用的是正则表达式:
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
-> Gather (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
-> Parallel Seq Scan on test5 (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Filter: 1666668
Planning Time: 0.068 ms
Execution Time: 7337.184 ms
查询有效(零行匹配),但速度太慢,超过 7 秒。
我认为用八卦索引会有帮助,但运气不好:
create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
-> Bitmap Heap Scan on test5 (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Index Recheck: 5000005
Heap Blocks: exact=47620
-> Bitmap Index Scan on test5_keyvalues_idx (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Planning Time: 0.412 ms
Execution Time: 23734.722 ms
使用三元组索引的查询慢了 3 倍!它仍然是 returns 正确的结果(零行)。我希望三字母索引能够立即找出任何地方都没有 second 0
字符串,并且速度非常快。
(动机:我想避免将 keyvalues
规范化为 , so I'm looking to encode the matching logic in a single TEXT
field using text indexing and regexps instead. The logic works, but is too slow, 。)
如您所见,这不适用于八卦。 Trigrams 不匹配 space 边界,因此如果您的所有数据都包含相同的词,则索引将匹配每一行。
这可能会让事情更清楚:
with data as (select * from (values ('^ first 1 | second 3'),
('^ first 1 | second 2 ^ first 2 | second 3'),
('^ first 1 | second 2 | second 3'),
('^ first 2 | second 3 ^ first 1 | second 2 | second 2'),
('^ first 2 | second 3 ^ first 1 | second 3')
) v(keyvalues)
)
select keyvalues, show_trgm(keyvalues) from data;
keyvalues | show_trgm
------------------------------------------------------+-------------------------------------------------------------------------------------------------------
^ first 1 | second 3 | {" 1"," 3"," f"," s"," 1 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 1 | second 2 ^ first 2 | second 3 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 1 | second 2 | second 3 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 2 | second 3 ^ first 1 | second 2 | second 2 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 2 | second 3 ^ first 1 | second 3 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
能否使用部分索引来排除中间带有 ^ 的行?
根据 OP,用户@jjanes 在 here 上 DBA.SE 给出了正确答案:
I expected the trigram index to figure out immediately there's no second 0
string anywhere
'second' 和 '0' 是单独的词,因此它无法检测到它们的联合缺失。它似乎可以检测到“0”的缺失,但来自 "contrib/pg_trgm/trgm_regexp.c" 的评论似乎是相关的:
* Note: Using again the example "foo bar", we will not consider the
* trigram " b", though this trigram would be found by the trigram
* extraction code. Since we will find " ba", it doesn't seem worth
* trying to hack the algorithm to generate the additional trigram.
由于0是模式串的最后一个字符,所以也不会出现"0a"形式的卦,所以它就错过了这个机会。
即使不是这个限制,你的方法也显得非常脆弱。
我在 Postgres 中有一个 TEXT keyvalues
列:
select * from test5 limit 5;
id | keyvalues
----+------------------------------------------------------
1 | ^ first 1 | second 3
2 | ^ first 1 | second 2 ^ first 2 | second 3
3 | ^ first 1 | second 2 | second 3
4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
5 | ^ first 2 | second 3 ^ first 1 | second 3
我的查询必须从匹配中间排除 ^
字符,所以我使用的是正则表达式:
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
-> Gather (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
-> Parallel Seq Scan on test5 (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Filter: 1666668
Planning Time: 0.068 ms
Execution Time: 7337.184 ms
查询有效(零行匹配),但速度太慢,超过 7 秒。
我认为用八卦索引会有帮助,但运气不好:
create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
-> Bitmap Heap Scan on test5 (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Index Recheck: 5000005
Heap Blocks: exact=47620
-> Bitmap Index Scan on test5_keyvalues_idx (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Planning Time: 0.412 ms
Execution Time: 23734.722 ms
使用三元组索引的查询慢了 3 倍!它仍然是 returns 正确的结果(零行)。我希望三字母索引能够立即找出任何地方都没有 second 0
字符串,并且速度非常快。
(动机:我想避免将 keyvalues
规范化为 TEXT
field using text indexing and regexps instead. The logic works, but is too slow,
如您所见,这不适用于八卦。 Trigrams 不匹配 space 边界,因此如果您的所有数据都包含相同的词,则索引将匹配每一行。
这可能会让事情更清楚:
with data as (select * from (values ('^ first 1 | second 3'),
('^ first 1 | second 2 ^ first 2 | second 3'),
('^ first 1 | second 2 | second 3'),
('^ first 2 | second 3 ^ first 1 | second 2 | second 2'),
('^ first 2 | second 3 ^ first 1 | second 3')
) v(keyvalues)
)
select keyvalues, show_trgm(keyvalues) from data;
keyvalues | show_trgm
------------------------------------------------------+-------------------------------------------------------------------------------------------------------
^ first 1 | second 3 | {" 1"," 3"," f"," s"," 1 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 1 | second 2 ^ first 2 | second 3 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 1 | second 2 | second 3 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 2 | second 3 ^ first 1 | second 2 | second 2 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
^ first 2 | second 3 ^ first 1 | second 3 | {" 1"," 2"," 3"," f"," s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
能否使用部分索引来排除中间带有 ^ 的行?
根据 OP,用户@jjanes 在 here 上 DBA.SE 给出了正确答案:
I expected the trigram index to figure out immediately there's no
second 0
string anywhere'second' 和 '0' 是单独的词,因此它无法检测到它们的联合缺失。它似乎可以检测到“0”的缺失,但来自 "contrib/pg_trgm/trgm_regexp.c" 的评论似乎是相关的:
* Note: Using again the example "foo bar", we will not consider the * trigram " b", though this trigram would be found by the trigram * extraction code. Since we will find " ba", it doesn't seem worth * trying to hack the algorithm to generate the additional trigram.
由于0是模式串的最后一个字符,所以也不会出现"0a"形式的卦,所以它就错过了这个机会。
即使不是这个限制,你的方法也显得非常脆弱。