使用 trigram 在 Django 中搜索文本
Text searching in Django with trigram
我想在我的应用程序中加快搜索结果的速度,但是,无论我使用什么方法,我总是得到相同的结果。由于它是 Django 应用程序,我将同时提供 ORM 命令和生成的 SQL 代码(使用 PostgreSQL)。
首先,我在数据库上启用了GIN索引和三元组运算:
其次,我创建了包含 2 个 varchar 列的 table:first_name 和 last_name(加上一个 id 字段作为主键)。
from django.db import models
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
我还用952条示例记录填充了数据库,这样我就不会出现Postgres因为数据集太小而避免使用索引的情况。
接下来,我 运行 对非索引数据进行查询。
简单的 LIKE 查询:
In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms
八卦相似:
In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms
还有一个带有排序结果的更花哨的查询:
In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms
下一步是创建索引:
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
class Meta:
indexes = [GinIndex(fields=['last_name'])]
这导致了以下 SQL 迁移:
./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;
现在我运行同样的命令。
喜欢:
In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms
八卦相似:
In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms
更复杂的查询:
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms
执行时间的变化似乎微不足道。在最后一个查询的情况下,扫描需要 0.643 个单位,而在前一个情况下为 0.657。时间也相差 0.02 毫秒(第二个查询 运行 甚至更慢一点)。是否有一些我遗漏的选项应该启用以帮助提高性能?是不是数据集太简单了?
我使用的文档:
编辑
我添加了几条记录(现在有将近 259 000 条记录)并再次进行 运行 测试。首先没有索引:
In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms
In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms
In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms
还有它:
In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms
In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms
仍然非常相似,似乎在避免使用杜松子酒索引。
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
这没有创建三元组索引。它使用 btree_gin 中的运算符在整个字符串上创建了一个 GIN 索引(您似乎没有将其用于任何好的目的)。要制作一个三元组索引,它需要看起来像这样:
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);
但我不知道如何让 django 这样做,我不是 Django 用户。
我想在我的应用程序中加快搜索结果的速度,但是,无论我使用什么方法,我总是得到相同的结果。由于它是 Django 应用程序,我将同时提供 ORM 命令和生成的 SQL 代码(使用 PostgreSQL)。
首先,我在数据库上启用了GIN索引和三元组运算:
其次,我创建了包含 2 个 varchar 列的 table:first_name 和 last_name(加上一个 id 字段作为主键)。
from django.db import models
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
我还用952条示例记录填充了数据库,这样我就不会出现Postgres因为数据集太小而避免使用索引的情况。
接下来,我 运行 对非索引数据进行查询。
简单的 LIKE 查询:
In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms
八卦相似:
In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms
还有一个带有排序结果的更花哨的查询:
In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms
下一步是创建索引:
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
class Meta:
indexes = [GinIndex(fields=['last_name'])]
这导致了以下 SQL 迁移:
./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;
现在我运行同样的命令。
喜欢:
In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms
八卦相似:
In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms
更复杂的查询:
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms
执行时间的变化似乎微不足道。在最后一个查询的情况下,扫描需要 0.643 个单位,而在前一个情况下为 0.657。时间也相差 0.02 毫秒(第二个查询 运行 甚至更慢一点)。是否有一些我遗漏的选项应该启用以帮助提高性能?是不是数据集太简单了?
我使用的文档:
编辑 我添加了几条记录(现在有将近 259 000 条记录)并再次进行 运行 测试。首先没有索引:
In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms
In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms
In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms
还有它:
In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms
In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms
仍然非常相似,似乎在避免使用杜松子酒索引。
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
这没有创建三元组索引。它使用 btree_gin 中的运算符在整个字符串上创建了一个 GIN 索引(您似乎没有将其用于任何好的目的)。要制作一个三元组索引,它需要看起来像这样:
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);
但我不知道如何让 django 这样做,我不是 Django 用户。