三元组索引 ORDER BY 优化
Trigram Index ORDER BY optimization
我正在尝试实现搜索功能,经过一些调查(参见 this interesting read by Yorick Peterse at GitLab)我决定选择使用 pg_trgm
扩展的三元组方法。
我想要 return 10 个最相关的行。
以下是我针对具有 110868 行的 table 测试的几个查询(在 the doc 之后):
SELECT name, similarity(name, 'search query') AS sml
FROM table
ORDER BY sml DESC, name;
Time: 701.814 ms
SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name;
Time: 376.692 ms
SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 378.921 ms
使用 GiST 索引:
CREATE INDEX trigram_index ON table USING GIST (name gist_trgm_ops);
SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 36.877 ms
使用 GIN 索引:
CREATE INDEX trigram_index ON table USING GIN (name gin_trgm_ops);
SELECT name, similarity(name, 'search query') AS sml
FROM table WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 18.992 ms
使用解释分析:
Limit (cost=632.37..632.39 rows=10 width=25) (actual time=22.202..22.204 rows=10 loops=1)
-> Sort (cost=632.37..632.64 rows=111 width=25) (actual time=22.201..22.201 rows=10 loops=1)
Sort Key: (similarity((name)::text, 'search query'::text)) DESC, name
Sort Method: top-N heapsort Memory: 26kB
-> Bitmap Heap Scan on table (cost=208.86..629.97 rows=111 width=25) (actual time=6.900..22.157 rows=134 loops=1)
Recheck Cond: ((name)::text % 'search query'::text)
Rows Removed by Index Recheck: 2274
Heap Blocks: exact=2257
-> Bitmap Index Scan on trigram_index (cost=0.00..208.83 rows=111 width=0) (actual time=6.532..6.532 rows=2408 loops=1)
Index Cond: ((name)::text % 'World of Warcraft'::text)
Planning time: 0.073 ms
Execution time: 18.521 ms
使用 GIN 索引可以显着提高性能。然而,将结果限制为 10 行似乎没有任何影响。
还有没有考虑到的改进空间?我对利用我只需要整个 table.
这一事实的建议特别感兴趣
正如the documentation所说,GIN索引无助于优化ORDER BY
子句:
A variant of the above query is
SELECT t, t <-> 'word' AS dist
FROM test_trgm
ORDER BY dist LIMIT 10;
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.
另一方面,对于更大的 tables,GIN 索引通常比 GiST 索引表现更好。
所以我认为您应该同时尝试两者,并且只使用在实际大小的测试中速度更快的那个 table。
除了使用更多 RAM 来缓存数据外,我不认为您可以改进更多。
我正在尝试实现搜索功能,经过一些调查(参见 this interesting read by Yorick Peterse at GitLab)我决定选择使用 pg_trgm
扩展的三元组方法。
我想要 return 10 个最相关的行。
以下是我针对具有 110868 行的 table 测试的几个查询(在 the doc 之后):
SELECT name, similarity(name, 'search query') AS sml
FROM table
ORDER BY sml DESC, name;
Time: 701.814 ms
SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name;
Time: 376.692 ms
SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 378.921 ms
使用 GiST 索引:
CREATE INDEX trigram_index ON table USING GIST (name gist_trgm_ops);
SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 36.877 ms
使用 GIN 索引:
CREATE INDEX trigram_index ON table USING GIN (name gin_trgm_ops);
SELECT name, similarity(name, 'search query') AS sml
FROM table WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 18.992 ms
使用解释分析:
Limit (cost=632.37..632.39 rows=10 width=25) (actual time=22.202..22.204 rows=10 loops=1)
-> Sort (cost=632.37..632.64 rows=111 width=25) (actual time=22.201..22.201 rows=10 loops=1)
Sort Key: (similarity((name)::text, 'search query'::text)) DESC, name
Sort Method: top-N heapsort Memory: 26kB
-> Bitmap Heap Scan on table (cost=208.86..629.97 rows=111 width=25) (actual time=6.900..22.157 rows=134 loops=1)
Recheck Cond: ((name)::text % 'search query'::text)
Rows Removed by Index Recheck: 2274
Heap Blocks: exact=2257
-> Bitmap Index Scan on trigram_index (cost=0.00..208.83 rows=111 width=0) (actual time=6.532..6.532 rows=2408 loops=1)
Index Cond: ((name)::text % 'World of Warcraft'::text)
Planning time: 0.073 ms
Execution time: 18.521 ms
使用 GIN 索引可以显着提高性能。然而,将结果限制为 10 行似乎没有任何影响。
还有没有考虑到的改进空间?我对利用我只需要整个 table.
这一事实的建议特别感兴趣正如the documentation所说,GIN索引无助于优化ORDER BY
子句:
A variant of the above query is
SELECT t, t <-> 'word' AS dist FROM test_trgm ORDER BY dist LIMIT 10;
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.
另一方面,对于更大的 tables,GIN 索引通常比 GiST 索引表现更好。
所以我认为您应该同时尝试两者,并且只使用在实际大小的测试中速度更快的那个 table。
除了使用更多 RAM 来缓存数据外,我不认为您可以改进更多。