django-haystack 自动完成 returns 结果太宽
django-haystack autocomplete returns too wide results
我创建了一个索引字段 title_auto
:
class GameIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, model_attr='title')
title = indexes.CharField(model_attr='title')
title_auto = indexes.NgramField(model_attr='title')
弹性搜索设置如下所示:
ELASTICSEARCH_INDEX_SETTINGS = {
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_ngram"],
"token_chars": ["letter", "digit"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
}
}
}
}
我尝试进行自动完成搜索,它有效,但是 returns 不相关的结果太多:
qs = SearchQuerySet().models(Game).autocomplete(title_auto=search_phrase)
或
qs = SearchQuerySet().models(Game).filter(title_auto=search_phrase)
它们都产生相同的输出。
如果 search_phrase 是 "monopoly",第一个结果的标题中包含 "Monopoly",但是,由于只有 2 个相关项目,它 returns 51. 其他与 "Monopoly" 完全无关。
所以我的问题是 - 如何更改结果的相关性?
很难确定,因为我还没有看到你的完整映射,但我怀疑问题是分析器(其中之一)同时用于索引和搜索。因此,当您索引文档时,会创建和索引许多 ngram 术语。如果您进行搜索并且您的搜索文本也以相同的方式进行分析,则会生成大量搜索词。由于最小的 ngram 是一个字母,几乎任何查询都会匹配很多文档。
我们写了一篇关于使用 ngram 进行自动完成的博客 post,您可能会发现它对您有所帮助,请点击此处:http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams。但我会给你一个更简单的例子来说明我的意思。我对 haystack 不是很熟悉,所以我可能帮不了你,但我可以用 Elasticsearch 中的 ngrams 解释这个问题。
首先,我将设置一个索引,使用 ngram 分析器进行索引和搜索:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "nGram_analyzer"
}
}
}
}
}
并添加一些文档:
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"monopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"oligopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"plutocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"theocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"title":"democracy"}
和 运行 一个简单的 match
搜索 "poly"
:
POST /test_index/_search
{
"query": {
"match": {
"title": "poly"
}
}
}
它return所有五个文档:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 4.729521,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 4.729521,
"_source": {
"title": "oligopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 4.3608603,
"_source": {
"title": "monopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.0197333,
"_source": {
"title": "plutocracy"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.31496215,
"_source": {
"title": "theocracy"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "5",
"_score": 0.31496215,
"_source": {
"title": "democracy"
}
}
]
}
}
这是因为搜索词 "poly"
被标记为词 "p"
、"o"
、"l"
和 "y"
,因为"title"
每个文档中的字段被标记为单字母术语,匹配每个文档。
如果我们改用此映射重建索引(相同的分析器和文档):
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "standard"
}
}
}
}
查询将 return 我们期望的结果:
POST /test_index/_search
{
"query": {
"match": {
"title": "poly"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.5108256,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.5108256,
"_source": {
"title": "monopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.5108256,
"_source": {
"title": "oligopoly"
}
}
]
}
}
Edge ngrams 的工作方式类似,只是只使用从单词开头开始的术语。
这是我在这个例子中使用的代码:
http://sense.qbox.io/gist/b24cbc531b483650c085a42963a49d6a23fa5579
不幸的是,目前似乎没有办法(除了实现自定义后端)通过 Django-Haystack 分别配置搜索分析器和索引分析器。
如果 Django-Haystack 自动完成 returns 太宽的结果,您可以利用每个搜索结果提供的分值来优化输出。
if search_query != "":
# Use autocomplete query or filter
# with results_filtered being a SearchQuerySet()
results_filtered = results_filtered.filter(text=search_query)
#Remove objects with a low score
for result in results_filtered:
if result.score < SEARCH_SCORE_THRESHOLD:
results_filtered = results_filtered.exclude(id=result.id)
它对我来说运行良好,无需定义我自己的后端和方案构建。
我创建了一个索引字段 title_auto
:
class GameIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, model_attr='title')
title = indexes.CharField(model_attr='title')
title_auto = indexes.NgramField(model_attr='title')
弹性搜索设置如下所示:
ELASTICSEARCH_INDEX_SETTINGS = {
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_ngram"],
"token_chars": ["letter", "digit"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
}
}
}
}
我尝试进行自动完成搜索,它有效,但是 returns 不相关的结果太多:
qs = SearchQuerySet().models(Game).autocomplete(title_auto=search_phrase)
或
qs = SearchQuerySet().models(Game).filter(title_auto=search_phrase)
它们都产生相同的输出。
如果 search_phrase 是 "monopoly",第一个结果的标题中包含 "Monopoly",但是,由于只有 2 个相关项目,它 returns 51. 其他与 "Monopoly" 完全无关。
所以我的问题是 - 如何更改结果的相关性?
很难确定,因为我还没有看到你的完整映射,但我怀疑问题是分析器(其中之一)同时用于索引和搜索。因此,当您索引文档时,会创建和索引许多 ngram 术语。如果您进行搜索并且您的搜索文本也以相同的方式进行分析,则会生成大量搜索词。由于最小的 ngram 是一个字母,几乎任何查询都会匹配很多文档。
我们写了一篇关于使用 ngram 进行自动完成的博客 post,您可能会发现它对您有所帮助,请点击此处:http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams。但我会给你一个更简单的例子来说明我的意思。我对 haystack 不是很熟悉,所以我可能帮不了你,但我可以用 Elasticsearch 中的 ngrams 解释这个问题。
首先,我将设置一个索引,使用 ngram 分析器进行索引和搜索:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "nGram_analyzer"
}
}
}
}
}
并添加一些文档:
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"monopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"oligopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"plutocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"theocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"title":"democracy"}
和 运行 一个简单的 match
搜索 "poly"
:
POST /test_index/_search
{
"query": {
"match": {
"title": "poly"
}
}
}
它return所有五个文档:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 4.729521,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 4.729521,
"_source": {
"title": "oligopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 4.3608603,
"_source": {
"title": "monopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.0197333,
"_source": {
"title": "plutocracy"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.31496215,
"_source": {
"title": "theocracy"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "5",
"_score": 0.31496215,
"_source": {
"title": "democracy"
}
}
]
}
}
这是因为搜索词 "poly"
被标记为词 "p"
、"o"
、"l"
和 "y"
,因为"title"
每个文档中的字段被标记为单字母术语,匹配每个文档。
如果我们改用此映射重建索引(相同的分析器和文档):
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "standard"
}
}
}
}
查询将 return 我们期望的结果:
POST /test_index/_search
{
"query": {
"match": {
"title": "poly"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.5108256,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.5108256,
"_source": {
"title": "monopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.5108256,
"_source": {
"title": "oligopoly"
}
}
]
}
}
Edge ngrams 的工作方式类似,只是只使用从单词开头开始的术语。
这是我在这个例子中使用的代码:
http://sense.qbox.io/gist/b24cbc531b483650c085a42963a49d6a23fa5579
不幸的是,目前似乎没有办法(除了实现自定义后端)通过 Django-Haystack 分别配置搜索分析器和索引分析器。 如果 Django-Haystack 自动完成 returns 太宽的结果,您可以利用每个搜索结果提供的分值来优化输出。
if search_query != "":
# Use autocomplete query or filter
# with results_filtered being a SearchQuerySet()
results_filtered = results_filtered.filter(text=search_query)
#Remove objects with a low score
for result in results_filtered:
if result.score < SEARCH_SCORE_THRESHOLD:
results_filtered = results_filtered.exclude(id=result.id)
它对我来说运行良好,无需定义我自己的后端和方案构建。