如何从文本长度超过 2000 个字符的数据库中获取类似的帖子

How do I fetch similar posts from database where text length can be more than 2000 characters

据我所知,没有简单、快速的解决方案。我正在尝试进行全文关键字或语义搜索,这是一个非常高级的主题。有专门为此创建的专用搜索服务器,但我仍然可以实现查询执行时间少于一秒的方法吗?

这是我到目前为止尝试过的方法:

begin;

SET pg_trgm.similarity_threshold = 0.3;

select
    id, <col_name>
    similarity(<column with gin index>,
    '<text to be searched>') as sml
from
    <table> p
where
    <clauses> and
 <indexed_col> % '<text to be searched>'    
 and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
 end;

创建的索引如下: CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);

上面的查询执行大约需要 6-7 秒,有时甚至需要 200 毫秒,这对我来说很奇怪,因为它根据我传入的相似性输入更改查询计划。

我尝试了 ts_vector @@ ts_query,但由于 & 运算符,它们变得过于严格。

编辑:这是上述查询的解释分析

  ->  Sort  (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
        Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
        Sort Method: quicksort  Memory: 25kB
        ->  Bitmap Heap Scan on posts p  (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
              Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
              Rows Removed by Index Recheck: 36258
              Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
              Heap Blocks: exact=24043
              ->  Bitmap Index Scan on trgm_idx  (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
                    Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```

您的 redundant/overlapping 查询条件没有帮助。设置 similarity_threshold=0.3 然后做

t % q and t <-> q < 0.5 

只是无缘无故地抛弃了索引选择性。将 similarity_threshold 设置为您想要使用的最严格的值,然后摆脱不必要的 <-> 条件。

您可以试试 GiST 版本的三元组索引。我可以直接从索引中支持 ORDER BY ... <-> ... LIMIT 10 操作。我怀疑它是否对 2000 个字符字符串非常有效,但值得一试。