使用 PostgreSQL 索引的 Django 全文搜索
Django full text search using indexes with PostgreSQL
解决了我在this question中询问的问题后,我正在尝试使用索引优化 FTS 的性能。
我在我的数据库上发出命令:
CREATE INDEX my_table_idx ON my_table USING gin(to_tsvector('italian', very_important_field), to_tsvector('italian', also_important_field), to_tsvector('italian', not_so_important_field), to_tsvector('italian', not_important_field), to_tsvector('italian', tags));
然后我编辑了模型的 Meta class 如下:
class MyEntry(models.Model):
very_important_field = models.TextField(blank=True, null=True)
also_important_field = models.TextField(blank=True, null=True)
not_so_important_field = models.TextField(blank=True, null=True)
not_important_field = models.TextField(blank=True, null=True)
tags = models.TextField(blank=True, null=True)
class Meta:
managed = False
db_table = 'my_table'
indexes = [
GinIndex(
fields=['very_important_field', 'also_important_field', 'not_so_important_field', 'not_important_field', 'tags'],
name='my_table_idx'
)
]
但似乎什么都没有改变。查找所需的时间与以前完全相同。
这是查找脚本:
from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
# other unrelated stuff here
vector = SearchVector("very_important_field", weight="A") + \
SearchVector("tags", weight="A") + \
SearchVector("also_important_field", weight="B") + \
SearchVector("not_so_important_field", weight="C") + \
SearchVector("not_important_field", weight="D")
query = SearchQuery(search_string, config="italian")
rank = SearchRank(vector, query, weights=[0.4, 0.6, 0.8, 1.0]). # D, C, B, A
full_text_search_qs = MyEntry.objects.annotate(rank=rank).filter(rank__gte=0.4).order_by("-rank")
我做错了什么?
编辑:
上面的查找包含在一个函数中,我在时间上使用了装饰器。该函数实际上returns一个列表,像这样:
@timeit
def search(search_string):
# the above code here
qs = list(full_text_search_qs)
return qs
这可能是问题所在吗?
我不确定,但根据 postgresql 文档 (https://www.postgresql.org/docs/9.5/static/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX):
Because the two-argument version of to_tsvector was used in the index
above, only a query reference that uses the 2-argument version of
to_tsvector with the same configuration name will use that index. That
is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index,
but WHERE to_tsvector(body) @@ 'a & b' cannot. This ensures that an
index will be used only with the same configuration used to create the
index entries.
我不知道 django 使用什么配置,但你可以尝试删除第一个参数
您需要将 SearchVectorField
添加到您的 MyEntry
,根据您的实际文本字段对其进行更新,然后对该字段执行搜索。但是,更新只能在记录保存到数据库后才能执行。
本质上:
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField
class MyEntry(models.Model):
# The fields that contain the raw data.
very_important_field = models.TextField(blank=True, null=True)
also_important_field = models.TextField(blank=True, null=True)
not_so_important_field = models.TextField(blank=True, null=True)
not_important_field = models.TextField(blank=True, null=True)
tags = models.TextField(blank=True, null=True)
# The field we actually going to search.
# Must be null=True because we cannot set it immediately during create()
search_vector = SearchVectorField(editable=False, null=True)
class Meta:
# The search index pointing to our actual search field.
indexes = [GinIndex(fields=["search_vector"])]
然后您可以照常创建普通实例,例如:
# Does not set MyEntry.search_vector yet.
my_entry = MyEntry.objects.create(
very_important_field="something very important", # Fake Italien text ;-)
also_important_field="something different but equally important"
not_so_important_field="this one matters less"
not_important_field="we don't care are about that one at all"
tags="things, stuff, whatever"
现在条目已存在于数据库中,您可以使用各种选项更新 search_vector
字段。例如 weight
指定重要性,config
使用一种默认语言配置。您也可以完全省略不想搜索的字段:
# Update search vector on existing database record.
my_entry.search_vector = (
SearchVector("very_important_field", weight="A", config="italien")
+ SearchVector("also_important_field", weight="A", config="italien")
+ SearchVector("not_so_important_field", weight="C", config="italien")
+ SearchVector("tags", weight="B", config="italien")
)
my_entry.save()
每次某些文本字段更改时手动更新 search_vector
字段可能容易出错,因此您可以考虑添加一个 SQL 触发器来使用 Django 迁移为您执行此操作。有关如何执行此操作的示例,请参阅有关 Full-text Search with Django and PostgreSQL.
的博客文章
要使用索引在 MyEntry
中实际搜索,您需要按 search_vector
字段进行过滤和排名。 SearchQuery
的 config
应与上面的 SearchVector
之一匹配(使用相同的停用词、词干提取等)。
例如:
from django.contrib.postgres.search import SearchQuery, SearchRank
from django.core.exceptions import ValidationError
from django.db.models import F, QuerySet
search_query = SearchQuery("important", search_type="websearch", config="italien")
search_rank = SearchRank(F("search_vector"), search_query)
my_entries_found = (
MyEntry.objects.annotate(rank=search_rank)
.filter(search_vector=search_query) # Perform full text search on index.
.order_by("-rank") # Yield most relevant entries first.
)
解决了我在this question中询问的问题后,我正在尝试使用索引优化 FTS 的性能。 我在我的数据库上发出命令:
CREATE INDEX my_table_idx ON my_table USING gin(to_tsvector('italian', very_important_field), to_tsvector('italian', also_important_field), to_tsvector('italian', not_so_important_field), to_tsvector('italian', not_important_field), to_tsvector('italian', tags));
然后我编辑了模型的 Meta class 如下:
class MyEntry(models.Model):
very_important_field = models.TextField(blank=True, null=True)
also_important_field = models.TextField(blank=True, null=True)
not_so_important_field = models.TextField(blank=True, null=True)
not_important_field = models.TextField(blank=True, null=True)
tags = models.TextField(blank=True, null=True)
class Meta:
managed = False
db_table = 'my_table'
indexes = [
GinIndex(
fields=['very_important_field', 'also_important_field', 'not_so_important_field', 'not_important_field', 'tags'],
name='my_table_idx'
)
]
但似乎什么都没有改变。查找所需的时间与以前完全相同。
这是查找脚本:
from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
# other unrelated stuff here
vector = SearchVector("very_important_field", weight="A") + \
SearchVector("tags", weight="A") + \
SearchVector("also_important_field", weight="B") + \
SearchVector("not_so_important_field", weight="C") + \
SearchVector("not_important_field", weight="D")
query = SearchQuery(search_string, config="italian")
rank = SearchRank(vector, query, weights=[0.4, 0.6, 0.8, 1.0]). # D, C, B, A
full_text_search_qs = MyEntry.objects.annotate(rank=rank).filter(rank__gte=0.4).order_by("-rank")
我做错了什么?
编辑:
上面的查找包含在一个函数中,我在时间上使用了装饰器。该函数实际上returns一个列表,像这样:
@timeit
def search(search_string):
# the above code here
qs = list(full_text_search_qs)
return qs
这可能是问题所在吗?
我不确定,但根据 postgresql 文档 (https://www.postgresql.org/docs/9.5/static/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX):
Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index. That is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index, but WHERE to_tsvector(body) @@ 'a & b' cannot. This ensures that an index will be used only with the same configuration used to create the index entries.
我不知道 django 使用什么配置,但你可以尝试删除第一个参数
您需要将 SearchVectorField
添加到您的 MyEntry
,根据您的实际文本字段对其进行更新,然后对该字段执行搜索。但是,更新只能在记录保存到数据库后才能执行。
本质上:
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField
class MyEntry(models.Model):
# The fields that contain the raw data.
very_important_field = models.TextField(blank=True, null=True)
also_important_field = models.TextField(blank=True, null=True)
not_so_important_field = models.TextField(blank=True, null=True)
not_important_field = models.TextField(blank=True, null=True)
tags = models.TextField(blank=True, null=True)
# The field we actually going to search.
# Must be null=True because we cannot set it immediately during create()
search_vector = SearchVectorField(editable=False, null=True)
class Meta:
# The search index pointing to our actual search field.
indexes = [GinIndex(fields=["search_vector"])]
然后您可以照常创建普通实例,例如:
# Does not set MyEntry.search_vector yet.
my_entry = MyEntry.objects.create(
very_important_field="something very important", # Fake Italien text ;-)
also_important_field="something different but equally important"
not_so_important_field="this one matters less"
not_important_field="we don't care are about that one at all"
tags="things, stuff, whatever"
现在条目已存在于数据库中,您可以使用各种选项更新 search_vector
字段。例如 weight
指定重要性,config
使用一种默认语言配置。您也可以完全省略不想搜索的字段:
# Update search vector on existing database record.
my_entry.search_vector = (
SearchVector("very_important_field", weight="A", config="italien")
+ SearchVector("also_important_field", weight="A", config="italien")
+ SearchVector("not_so_important_field", weight="C", config="italien")
+ SearchVector("tags", weight="B", config="italien")
)
my_entry.save()
每次某些文本字段更改时手动更新 search_vector
字段可能容易出错,因此您可以考虑添加一个 SQL 触发器来使用 Django 迁移为您执行此操作。有关如何执行此操作的示例,请参阅有关 Full-text Search with Django and PostgreSQL.
要使用索引在 MyEntry
中实际搜索,您需要按 search_vector
字段进行过滤和排名。 SearchQuery
的 config
应与上面的 SearchVector
之一匹配(使用相同的停用词、词干提取等)。
例如:
from django.contrib.postgres.search import SearchQuery, SearchRank
from django.core.exceptions import ValidationError
from django.db.models import F, QuerySet
search_query = SearchQuery("important", search_type="websearch", config="italien")
search_rank = SearchRank(F("search_vector"), search_query)
my_entries_found = (
MyEntry.objects.annotate(rank=search_rank)
.filter(search_vector=search_query) # Perform full text search on index.
.order_by("-rank") # Yield most relevant entries first.
)