优化在文本搜索中使用正则表达式的 psql 查询
Optimizing psql query that's using regex on text search
在 psql 中我有以下查询。
关于如何加速它的建议up/optimize它?
我在标题和标题上尝试了各种索引,但它们都没有被使用。
"SELECT \"people\".* FROM \"people\" WHERE (((TITLE IS NOT NULL AND title ~* '(^| )(one|two|three)( |$|,)' AND title !~* '(^| )(four|five|six)( |$|,)') OR (TITLE IS NULL AND headline ~* '(^| )(one|two|three)( |$|,)' AND headline !~* '(^| )(four|five|six)( |$|,)')) AND ((TITLE IS NOT NULL AND title ~* '(^| )(seven|eight|nine)( |$|,)' AND title !~* '(^| )(ten|eleven)( |$|,)') OR (TITLE IS NULL AND headline ~* '(^| )(seven|eight|nine)( |$|,)' AND headline !~* '(^| )(ten|eleven)( |$|,)')))"
这是解释:
Gather (cost=1000.00..286343.58 rows=61760 width=715)
Workers Planned: 2
-> Parallel Seq Scan on people (cost=0.00..279167.58 rows=25733 width=715)
Filter: ((((title IS NOT NULL) AND ((title)::text ~* '(^| )(one|two|three)( |$|,)'::text) AND ((title)::text !~* '(^| )(four|five|six)( |$|,)'::text)) OR ((title IS NULL) AND ((headline)::text ~* '(^| )(one|two|three)( |$|,)'::text) AND ((headline)::text !~* '(^| )(four|five|six)( |$|,)'::text))) AND (((title IS NOT NULL) AND ((title)::text ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND ((title)::text !~* '(^| )(ten|eleven)( |$|,)'::text)) OR ((title IS NULL) AND ((headline)::text ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND ((headline)::text !~* '(^| )(ten|eleven)( |$|,)'::text))))
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
(7 rows)
传统关系数据库不会在列上使用索引,除非在条件中指定列的前导部分,即:
... where my_column like 'FOO%' -- will (usually) use index
... where my_column like '%FOO%' -- will (usually) not use index
要在内容中高效地搜索术语,您需要一种基于文本的搜索技术。
幸运的是,postgres 提供了对 full text search 的支持,这将为您的任务提供出色的性能和方便的语法。
如果我在右边的列上创建一个三元组索引,它会直接支持这个。
令我惊讶的是,它实际上也非常有效。这并不总是给定的,一些正则表达式不能分解成一组有效的三元组。
create extension pg_trgm;
create index on people using gin (title gin_trgm_ops, headline gin_trgm_ops);
为百万行给出这个亚毫秒计划table:
Bitmap Heap Scan on people (cost=547.25..551.28 rows=1 width=12) (actual time=0.741..0.743 rows=1 loops=1)
Recheck Cond: (((title ~* '(^| )(one|two|three)( |$|,)'::text) OR (headline ~* '(^| )(one|two|three)( |$|,)'::text)) AND ((title ~* '(^| )(seven|eight|nine)( |$|,)'::text) OR (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text)))
Filter: ((((title IS NOT NULL) AND (title ~* '(^| )(one|two|three)( |$|,)'::text) AND (title !~* '(^| )(four|five|six)( |$|,)'::text)) OR ((title IS NULL) AND (headline ~* '(^| )(one|two|three)( |$|,)'::text) AND (headline !~* '(^| )(four|five|six)( |$|,)'::text))) AND (((title IS NOT NULL) AND (title ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND (title !~* '(^| )(ten|eleven)( |$|,)'::text)) OR ((title IS NULL) AND (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND (headline !~* '(^| )(ten|eleven)( |$|,)'::text))))
Rows Removed by Filter: 2
Heap Blocks: exact=1
-> BitmapAnd (cost=547.25..547.25 rows=1 width=0) (actual time=0.701..0.702 rows=0 loops=1)
-> BitmapOr (cost=241.50..241.50 rows=200 width=0) (actual time=0.395..0.395 rows=0 loops=1)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..120.75 rows=100 width=0) (actual time=0.208..0.208 rows=80 loops=1)
Index Cond: (title ~* '(^| )(one|two|three)( |$|,)'::text)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..120.75 rows=100 width=0) (actual time=0.186..0.186 rows=60 loops=1)
Index Cond: (headline ~* '(^| )(one|two|three)( |$|,)'::text)
-> BitmapOr (cost=305.50..305.50 rows=200 width=0) (actual time=0.301..0.301 rows=0 loops=1)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..152.75 rows=100 width=0) (actual time=0.145..0.145 rows=3 loops=1)
Index Cond: (title ~* '(^| )(seven|eight|nine)( |$|,)'::text)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..152.75 rows=100 width=0) (actual time=0.156..0.156 rows=2 loops=1)
Index Cond: (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text)
没有索引需要 500 毫秒。
但是,如果每一行都匹配正表达式,但随后又通过匹配负表达式 (!~*
) 而被排除,那么索引将无济于事。
在 psql 中我有以下查询。
关于如何加速它的建议up/optimize它?
我在标题和标题上尝试了各种索引,但它们都没有被使用。
"SELECT \"people\".* FROM \"people\" WHERE (((TITLE IS NOT NULL AND title ~* '(^| )(one|two|three)( |$|,)' AND title !~* '(^| )(four|five|six)( |$|,)') OR (TITLE IS NULL AND headline ~* '(^| )(one|two|three)( |$|,)' AND headline !~* '(^| )(four|five|six)( |$|,)')) AND ((TITLE IS NOT NULL AND title ~* '(^| )(seven|eight|nine)( |$|,)' AND title !~* '(^| )(ten|eleven)( |$|,)') OR (TITLE IS NULL AND headline ~* '(^| )(seven|eight|nine)( |$|,)' AND headline !~* '(^| )(ten|eleven)( |$|,)')))"
这是解释:
Gather (cost=1000.00..286343.58 rows=61760 width=715)
Workers Planned: 2
-> Parallel Seq Scan on people (cost=0.00..279167.58 rows=25733 width=715)
Filter: ((((title IS NOT NULL) AND ((title)::text ~* '(^| )(one|two|three)( |$|,)'::text) AND ((title)::text !~* '(^| )(four|five|six)( |$|,)'::text)) OR ((title IS NULL) AND ((headline)::text ~* '(^| )(one|two|three)( |$|,)'::text) AND ((headline)::text !~* '(^| )(four|five|six)( |$|,)'::text))) AND (((title IS NOT NULL) AND ((title)::text ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND ((title)::text !~* '(^| )(ten|eleven)( |$|,)'::text)) OR ((title IS NULL) AND ((headline)::text ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND ((headline)::text !~* '(^| )(ten|eleven)( |$|,)'::text))))
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
(7 rows)
传统关系数据库不会在列上使用索引,除非在条件中指定列的前导部分,即:
... where my_column like 'FOO%' -- will (usually) use index
... where my_column like '%FOO%' -- will (usually) not use index
要在内容中高效地搜索术语,您需要一种基于文本的搜索技术。
幸运的是,postgres 提供了对 full text search 的支持,这将为您的任务提供出色的性能和方便的语法。
如果我在右边的列上创建一个三元组索引,它会直接支持这个。
令我惊讶的是,它实际上也非常有效。这并不总是给定的,一些正则表达式不能分解成一组有效的三元组。
create extension pg_trgm;
create index on people using gin (title gin_trgm_ops, headline gin_trgm_ops);
为百万行给出这个亚毫秒计划table:
Bitmap Heap Scan on people (cost=547.25..551.28 rows=1 width=12) (actual time=0.741..0.743 rows=1 loops=1)
Recheck Cond: (((title ~* '(^| )(one|two|three)( |$|,)'::text) OR (headline ~* '(^| )(one|two|three)( |$|,)'::text)) AND ((title ~* '(^| )(seven|eight|nine)( |$|,)'::text) OR (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text)))
Filter: ((((title IS NOT NULL) AND (title ~* '(^| )(one|two|three)( |$|,)'::text) AND (title !~* '(^| )(four|five|six)( |$|,)'::text)) OR ((title IS NULL) AND (headline ~* '(^| )(one|two|three)( |$|,)'::text) AND (headline !~* '(^| )(four|five|six)( |$|,)'::text))) AND (((title IS NOT NULL) AND (title ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND (title !~* '(^| )(ten|eleven)( |$|,)'::text)) OR ((title IS NULL) AND (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text) AND (headline !~* '(^| )(ten|eleven)( |$|,)'::text))))
Rows Removed by Filter: 2
Heap Blocks: exact=1
-> BitmapAnd (cost=547.25..547.25 rows=1 width=0) (actual time=0.701..0.702 rows=0 loops=1)
-> BitmapOr (cost=241.50..241.50 rows=200 width=0) (actual time=0.395..0.395 rows=0 loops=1)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..120.75 rows=100 width=0) (actual time=0.208..0.208 rows=80 loops=1)
Index Cond: (title ~* '(^| )(one|two|three)( |$|,)'::text)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..120.75 rows=100 width=0) (actual time=0.186..0.186 rows=60 loops=1)
Index Cond: (headline ~* '(^| )(one|two|three)( |$|,)'::text)
-> BitmapOr (cost=305.50..305.50 rows=200 width=0) (actual time=0.301..0.301 rows=0 loops=1)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..152.75 rows=100 width=0) (actual time=0.145..0.145 rows=3 loops=1)
Index Cond: (title ~* '(^| )(seven|eight|nine)( |$|,)'::text)
-> Bitmap Index Scan on people_title_headline_idx (cost=0.00..152.75 rows=100 width=0) (actual time=0.156..0.156 rows=2 loops=1)
Index Cond: (headline ~* '(^| )(seven|eight|nine)( |$|,)'::text)
没有索引需要 500 毫秒。
但是,如果每一行都匹配正表达式,但随后又通过匹配负表达式 (!~*
) 而被排除,那么索引将无济于事。