使用 trigram 在 Django 中搜索文本

Text searching in Django with trigram

我想在我的应用程序中加快搜索结果的速度,但是,无论我使用什么方法,我总是得到相同的结果。由于它是 Django 应用程序,我将同时提供 ORM 命令和生成的 SQL 代码(使用 PostgreSQL)。

首先,我在数据库上启用了GIN索引和三元组运算:

其次,我创建了包含 2 个 varchar 列的 table:first_name 和 last_name(加上一个 id 字段作为主键)。

from django.db import models


class Author(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)

我还用952条示例记录填充了数据库,这样我就不会出现Postgres因为数据集太小而避免使用索引的情况。

接下来,我 运行 对非索引数据进行查询。

简单的 LIKE 查询:

In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)

In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
    ...: rue))
Seq Scan on reviews_author  (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
  Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
  Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms

八卦相似:

In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari

In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
    ...: lyze=True))
Seq Scan on reviews_author  (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
  Filter: ((last_name)::text % 'ari'::text)
  Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms

还有一个带有排序结果的更花哨的查询:

In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC

In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
  Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Seq Scan on reviews_author  (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
        Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
        Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms

下一步是创建索引:

class Author(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)

    class Meta:
        indexes = [GinIndex(fields=['last_name'])]

这导致了以下 SQL 迁移:

./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;

现在我运行同样的命令。

喜欢:

In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)

In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
    ...: rue))
Seq Scan on reviews_author  (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
  Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
  Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms

八卦相似:

In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari

In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
    ...: lyze=True))
Seq Scan on reviews_author  (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
  Filter: ((last_name)::text % 'ari'::text)
  Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms

更复杂的查询:

In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC

In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
    ...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
  Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Seq Scan on reviews_author  (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
        Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
        Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms

执行时间的变化似乎微不足道。在最后一个查询的情况下,扫描需要 0.643 个单位,而在前一个情况下为 0.657。时间也相差 0.02 毫秒(第二个查询 运行 甚至更慢一点)。是否有一些我遗漏的选项应该启用以帮助提高性能?是不是数据集太简单了?

我使用的文档:

编辑 我添加了几条记录(现在有将近 259 000 条记录)并再次进行 运行 测试。首先没有索引:

In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author  (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
  Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
  Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms

In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather  (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Parallel Seq Scan on reviews_author  (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
        Filter: ((last_name)::text % 'bar'::text)
        Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms

In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
  Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
  Sort Method: quicksort  Memory: 1485kB
  ->  Seq Scan on reviews_author  (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
        Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
        Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms

还有它:

In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author  (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
  Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
  Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms

In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather  (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Parallel Seq Scan on reviews_author  (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
        Filter: ((last_name)::text % 'bar'::text)
        Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms

In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort  (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
  Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
  Sort Method: quicksort  Memory: 1485kB
  ->  Seq Scan on reviews_author  (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
        Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
        Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms

仍然非常相似,似乎在避免使用杜松子酒索引。

CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");

这没有创建三元组索引。它使用 btree_gin 中的运算符在整个字符串上创建了一个 GIN 索引(您似乎没有将其用于任何好的目的)。要制作一个三元组索引,它需要看起来像这样:

CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);

但我不知道如何让 django 这样做,我不是 Django 用户。