如何从文本长度超过 2000 个字符的数据库中获取类似的帖子
How do I fetch similar posts from database where text length can be more than 2000 characters
据我所知,没有简单、快速的解决方案。我正在尝试进行全文关键字或语义搜索,这是一个非常高级的主题。有专门为此创建的专用搜索服务器,但我仍然可以实现查询执行时间少于一秒的方法吗?
这是我到目前为止尝试过的方法:
begin;
SET pg_trgm.similarity_threshold = 0.3;
select
id, <col_name>
similarity(<column with gin index>,
'<text to be searched>') as sml
from
<table> p
where
<clauses> and
<indexed_col> % '<text to be searched>'
and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
end;
创建的索引如下:
CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);
上面的查询执行大约需要 6-7 秒,有时甚至需要 200 毫秒,这对我来说很奇怪,因为它根据我传入的相似性输入更改查询计划。
我尝试了 ts_vector @@ ts_query,但由于 &
运算符,它们变得过于严格。
编辑:这是上述查询的解释分析
-> Sort (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on posts p (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Rows Removed by Index Recheck: 36258
Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
Heap Blocks: exact=24043
-> Bitmap Index Scan on trgm_idx (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```
您的 redundant/overlapping 查询条件没有帮助。设置 similarity_threshold=0.3 然后做
t % q and t <-> q < 0.5
只是无缘无故地抛弃了索引选择性。将 similarity_threshold 设置为您想要使用的最严格的值,然后摆脱不必要的 <->
条件。
您可以试试 GiST 版本的三元组索引。我可以直接从索引中支持 ORDER BY ... <-> ... LIMIT 10
操作。我怀疑它是否对 2000 个字符字符串非常有效,但值得一试。
据我所知,没有简单、快速的解决方案。我正在尝试进行全文关键字或语义搜索,这是一个非常高级的主题。有专门为此创建的专用搜索服务器,但我仍然可以实现查询执行时间少于一秒的方法吗?
这是我到目前为止尝试过的方法:
begin;
SET pg_trgm.similarity_threshold = 0.3;
select
id, <col_name>
similarity(<column with gin index>,
'<text to be searched>') as sml
from
<table> p
where
<clauses> and
<indexed_col> % '<text to be searched>'
and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
end;
创建的索引如下:
CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);
上面的查询执行大约需要 6-7 秒,有时甚至需要 200 毫秒,这对我来说很奇怪,因为它根据我传入的相似性输入更改查询计划。
我尝试了 ts_vector @@ ts_query,但由于 &
运算符,它们变得过于严格。
编辑:这是上述查询的解释分析
-> Sort (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on posts p (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Rows Removed by Index Recheck: 36258
Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
Heap Blocks: exact=24043
-> Bitmap Index Scan on trgm_idx (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```
您的 redundant/overlapping 查询条件没有帮助。设置 similarity_threshold=0.3 然后做
t % q and t <-> q < 0.5
只是无缘无故地抛弃了索引选择性。将 similarity_threshold 设置为您想要使用的最严格的值,然后摆脱不必要的 <->
条件。
您可以试试 GiST 版本的三元组索引。我可以直接从索引中支持 ORDER BY ... <-> ... LIMIT 10
操作。我怀疑它是否对 2000 个字符字符串非常有效,但值得一试。