ElasticSearch 中现有字段的完成建议
Completion Suggester in ElasticSearch On Existing Field
在我的 elasticsearch 索引中,我索引了一堆工作。为简单起见,我们只说它们是一堆职称。当人们在我的搜索引擎中输入职位名称时,我想 "Auto Complete" 找到可能的匹配项。
我在这里调查了 Completion Suggester:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
然而,我发现的所有示例都涉及在您的索引上创建一个新字段,并在 indexing/rivering 时手动填充该字段。
有什么方法可以在现有字段上获得完成建议吗?即使这意味着重新索引数据也没关系。例如,当我想保留原始 not_analysed 文本时,我可以在映射中做这样的事情:
"JobTitle": {
"type": "string",
"fields": {
"Original": {
"type": "string",
"index": "not_analyzed"
}
}
}
这可能与建议者有关吗?
如果不是,是否可以进行非空白 tokenizing/N-Gram 搜索来获取这些字段?虽然它会更慢,但我认为这会起作用。
好的,这是(可能或)可能无法缩放的简单方法,使用 prefix queries。
我将使用您提到的 "fields"
技术和我找到的一些方便的职位描述数据创建索引 here:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"experienced bra fitter", "desc":"I bet they had trouble finding candidates for this one."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PlayStation Brand Ambassador", "desc":"please report to your residence in the United States of Nintendo."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Eyebrow Threading", "desc":"I REALLY hope this has something to do with dolls."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Administraive/ Secretary", "desc":"ok, ok, we get it. It’s clear where you need help."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Finish Carpenter", "desc":"for when the Start Carpenter gets tired."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Helpdesk Technician @ Pentagon", "desc":"“Uh, hello? I’m having a problem with this missile…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Nail Tech", "desc":"so nails can be pretty complicated…"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Remedy Engineer", "desc":"aren’t those called “doctors”?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Saltlick Cashier", "desc":"new trend in the equestrian industry. Ok, enough horsing around."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Molecular Biologist II", "desc":"when Molecular Biologist I gets promoted."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Breakfast Sandwich Maker", "desc":"we also got one of these recently."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Hotel Housekeepers", "desc":"why can’t they just say ‘hotelkeepers’?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Preschool Teacher #4065", "desc":"either that’s a really big school or they’ve got robot teachers."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"glacéau drop team", "desc":"for a new sport at the Winter Olympics: ice-water spilling."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PLUMMER/ELECTRICIAN", "desc":"get a dictionary/thesaurus first."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"DoodyCalls Technician", "desc":"they really shouldn’t put down janitors like that."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Golf Staff", "desc":"and here I thought they were called clubs."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Pressure Washers", "desc":"what’s next, heat cleaners?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Sandwich Artist", "desc":"another “Jesus in my food” wannabe."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Self Storage Manager", "desc":"this is for self storage?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Qualified Infant Caregiver", "desc":"too bad for all the unqualified caregivers on the list."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Ground Support", "desc":"but there’s just more dirt under there."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Gymboree Teacher", "desc":"the hardest part is not burning your hands sliding down the pole."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"COMMERCIAL space hunter", "desc":"so they did find animals further out in the cosmos? Who knew."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"JOB COACH", "desc":"if they’re unemployed when they get to you, what does that say about them?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"KIDS KAMP INSTRUCTOR!", "desc":"no spelling ability required."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"POOLS SUPERVISOR", "desc":"“yeah, they’re still wet…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"HOUSE MANAGER/TEEN SUPERVISOR", "desc":"see the dictionary under P, for Parent."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Licensed Seamless Gutter Contractor", "desc":"just sounds bad."}
那我就可以轻松运行一个前缀查询:
POST /test_index/_search
{
"query": {
"prefix": {
"title": {
"value": "san"
}
}
}
}
...
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "mcRfqtwzTyWE7ZNsKFvwEg",
"_score": 1,
"_source": {
"title": "Breakfast Sandwich Maker",
"desc": "we also got one of these recently."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}
或者如果我想对匹配更加小心,我可以使用未分析的字段:
POST /test_index/_search
{
"query": {
"prefix": {
"title.raw": {
"value": "San"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}
这是简单的方法。 Ngrams 有点复杂,但并不困难。我稍后会在另一个答案中添加它。
这是我使用的代码:
http://sense.qbox.io/gist/4e066d051d7dab5fe819264b0f4b26d958d115a9
编辑:Ngram 版本
借用this blog post的解析器(不要脸的插件),我可以这样设置索引:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
请注意,我使用不同的分析器进行索引和搜索;这很重要,因为如果搜索查询被分解成 ngram,我们可能会得到比我们想要的更多的点击率。
使用上面使用的相同数据集进行填充,我可以使用简单的 match
查询来获得我期望的结果:
POST /test_index/_search
{
"query": {
"match": {
"title": "sup"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.8631258,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "4pcAOmPNSYupjz7lSes8jw",
"_score": 1.8631258,
"_source": {
"title": "Ground Support",
"desc": "but there’s just more dirt under there."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "DVFOC6DsTa6eH_a-RtbUUw",
"_score": 1.8631258,
"_source": {
"title": "POOLS SUPERVISOR",
"desc": "“yeah, they’re still wet…”"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "klleY_bnQ4uFmCPF94sLOw",
"_score": 1.4905007,
"_source": {
"title": "HOUSE MANAGER/TEEN SUPERVISOR",
"desc": "see the dictionary under P, for Parent."
}
}
]
}
}
代码如下:
http://sense.qbox.io/gist/b0e77bb7f05a4527de5ab4345749c793f923794c
在我的 elasticsearch 索引中,我索引了一堆工作。为简单起见,我们只说它们是一堆职称。当人们在我的搜索引擎中输入职位名称时,我想 "Auto Complete" 找到可能的匹配项。
我在这里调查了 Completion Suggester:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
然而,我发现的所有示例都涉及在您的索引上创建一个新字段,并在 indexing/rivering 时手动填充该字段。
有什么方法可以在现有字段上获得完成建议吗?即使这意味着重新索引数据也没关系。例如,当我想保留原始 not_analysed 文本时,我可以在映射中做这样的事情:
"JobTitle": {
"type": "string",
"fields": {
"Original": {
"type": "string",
"index": "not_analyzed"
}
}
}
这可能与建议者有关吗?
如果不是,是否可以进行非空白 tokenizing/N-Gram 搜索来获取这些字段?虽然它会更慢,但我认为这会起作用。
好的,这是(可能或)可能无法缩放的简单方法,使用 prefix queries。
我将使用您提到的 "fields"
技术和我找到的一些方便的职位描述数据创建索引 here:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"experienced bra fitter", "desc":"I bet they had trouble finding candidates for this one."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PlayStation Brand Ambassador", "desc":"please report to your residence in the United States of Nintendo."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Eyebrow Threading", "desc":"I REALLY hope this has something to do with dolls."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Administraive/ Secretary", "desc":"ok, ok, we get it. It’s clear where you need help."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Finish Carpenter", "desc":"for when the Start Carpenter gets tired."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Helpdesk Technician @ Pentagon", "desc":"“Uh, hello? I’m having a problem with this missile…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Nail Tech", "desc":"so nails can be pretty complicated…"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Remedy Engineer", "desc":"aren’t those called “doctors”?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Saltlick Cashier", "desc":"new trend in the equestrian industry. Ok, enough horsing around."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Molecular Biologist II", "desc":"when Molecular Biologist I gets promoted."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Breakfast Sandwich Maker", "desc":"we also got one of these recently."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Hotel Housekeepers", "desc":"why can’t they just say ‘hotelkeepers’?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Preschool Teacher #4065", "desc":"either that’s a really big school or they’ve got robot teachers."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"glacéau drop team", "desc":"for a new sport at the Winter Olympics: ice-water spilling."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"PLUMMER/ELECTRICIAN", "desc":"get a dictionary/thesaurus first."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"DoodyCalls Technician", "desc":"they really shouldn’t put down janitors like that."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Golf Staff", "desc":"and here I thought they were called clubs."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Pressure Washers", "desc":"what’s next, heat cleaners?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Sandwich Artist", "desc":"another “Jesus in my food” wannabe."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Self Storage Manager", "desc":"this is for self storage?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Qualified Infant Caregiver", "desc":"too bad for all the unqualified caregivers on the list."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Ground Support", "desc":"but there’s just more dirt under there."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Gymboree Teacher", "desc":"the hardest part is not burning your hands sliding down the pole."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"COMMERCIAL space hunter", "desc":"so they did find animals further out in the cosmos? Who knew."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"JOB COACH", "desc":"if they’re unemployed when they get to you, what does that say about them?"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"KIDS KAMP INSTRUCTOR!", "desc":"no spelling ability required."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"POOLS SUPERVISOR", "desc":"“yeah, they’re still wet…”"}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"HOUSE MANAGER/TEEN SUPERVISOR", "desc":"see the dictionary under P, for Parent."}
{"index":{"_index":"test_index","_type":"doc"}}
{"title":"Licensed Seamless Gutter Contractor", "desc":"just sounds bad."}
那我就可以轻松运行一个前缀查询:
POST /test_index/_search
{
"query": {
"prefix": {
"title": {
"value": "san"
}
}
}
}
...
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "mcRfqtwzTyWE7ZNsKFvwEg",
"_score": 1,
"_source": {
"title": "Breakfast Sandwich Maker",
"desc": "we also got one of these recently."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}
或者如果我想对匹配更加小心,我可以使用未分析的字段:
POST /test_index/_search
{
"query": {
"prefix": {
"title.raw": {
"value": "San"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "fIYV0WOWRe6gfpYy_u2jlg",
"_score": 1,
"_source": {
"title": "Sandwich Artist",
"desc": "another “Jesus in my food” wannabe."
}
}
]
}
}
这是简单的方法。 Ngrams 有点复杂,但并不困难。我稍后会在另一个答案中添加它。
这是我使用的代码:
http://sense.qbox.io/gist/4e066d051d7dab5fe819264b0f4b26d958d115a9
编辑:Ngram 版本
借用this blog post的解析器(不要脸的插件),我可以这样设置索引:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
请注意,我使用不同的分析器进行索引和搜索;这很重要,因为如果搜索查询被分解成 ngram,我们可能会得到比我们想要的更多的点击率。
使用上面使用的相同数据集进行填充,我可以使用简单的 match
查询来获得我期望的结果:
POST /test_index/_search
{
"query": {
"match": {
"title": "sup"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.8631258,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "4pcAOmPNSYupjz7lSes8jw",
"_score": 1.8631258,
"_source": {
"title": "Ground Support",
"desc": "but there’s just more dirt under there."
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "DVFOC6DsTa6eH_a-RtbUUw",
"_score": 1.8631258,
"_source": {
"title": "POOLS SUPERVISOR",
"desc": "“yeah, they’re still wet…”"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "klleY_bnQ4uFmCPF94sLOw",
"_score": 1.4905007,
"_source": {
"title": "HOUSE MANAGER/TEEN SUPERVISOR",
"desc": "see the dictionary under P, for Parent."
}
}
]
}
}
代码如下:
http://sense.qbox.io/gist/b0e77bb7f05a4527de5ab4345749c793f923794c