Elasticsearch:所有前缀必须出现在文档中
Elasticsearch: all prefixes must appear in a document
我正在使用 match_phrase_prefix
,但我希望查询中的所有单词都被视为前缀,并且这些前缀必须出现在文档中,而不管顺序如何。文档中的额外标记没问题。
例如搜索Nik shoe Mic Jord应该匹配:
- Nike shoes 由 Michael Jord 穿着一个
- Michael Jordan 穿着 shoes 来自 Nike
但是,以下不应匹配:
- Mike Jordan(因为只有前缀 Jord )
- Nike 为 Michael Jordan 所有(因为前缀鞋缺失)
所以问题是:如何将所有单词都视为前缀,以及如何确保所有前缀都出现在文档中?
Returns documents that contain the words of a provided text, in the
same order as provided. The last term of the provided text is treated
as a prefix, matching any words that begin with that term.
所以 "Nik shoe Mic Jord" 将只对 Jord 进行短语搜索,并且标记也必须以相同的顺序出现。
对于所有标记的短语搜索,使用 edge n gram
The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.
映射
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2, --> size of tokens
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
文档:
[
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6RTKTXEBLqTvxU9z8bl3",
"_score" : 1.0,
"_source" : {
"title" : "Nike shoes are worn by Michael Jordan"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6hTLTXEBLqTvxU9zIrks",
"_score" : 1.0,
"_source" : {
"title" : "Michael Jordan wears shoes from Nike"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6xTLTXEBLqTvxU9zQbm4",
"_score" : 1.0,
"_source" : {
"title" : "Mike Jordan"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "7BTLTXEBLqTvxU9zbLkT",
"_score" : 1.0,
"_source" : {
"title" : "Nike is owned by Michael Jordan"
}
}
]
查询:
{
"query": {
"match": {
"title": {
"query": "Nik shoe Mic Jord",
"operator": "and" --> all tokens are needed
}
}
}
}
结果:
[
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6RTKTXEBLqTvxU9z8bl3",
"_score" : 3.2434955,
"_source" : {
"title" : "Nike shoes are worn by Michael Jordan"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6hTLTXEBLqTvxU9zIrks",
"_score" : 3.1820722,
"_source" : {
"title" : "Michael Jordan wears shoes from Nike"
}
}
]
对于带有 min-gram:2 和 max-gram:5 的单词 "michael" 生成以下标记
{
"token" : "Mi",
"start_offset" : 23,
"end_offset" : 25,
"type" : "word",
"position" : 13
},
{
"token" : "Mic",
"start_offset" : 23,
"end_offset" : 26,
"type" : "word",
"position" : 14
},
{
"token" : "Mich",
"start_offset" : 23,
"end_offset" : 27,
"type" : "word",
"position" : 15
},
{
"token" : "Micha",
"start_offset" : 23,
"end_offset" : 28,
"type" : "word",
"position" : 16
},
{
"token" : "Michae",
"start_offset" : 23,
"end_offset" : 29,
"type" : "word",
"position" : 17
},
{
"token" : "Michael",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 18
}
2 到 5 个单词的标记大小,因此选择最小和最大克数很重要。大的差异会导致你的索引膨胀,小的差异会导致文档不匹配
虽然@jaspreet 给出的解决方案可行,但它需要创建边缘 n-gram 标记,除了找到 min and max ngram
的正确平衡之外,它可能会创建巨大的索引大小,对于小的部分查询和 会导致性能问题,并且一次又一次地更改此设置将需要重新编制索引。
另一种解决方案是使用 the prefix queries, specially created for such use-cases,如下例所示,只有一个警告,因为您正在使用区分大小写的搜索(理想情况下,您应该将所有搜索词小写以使其成为不区分大小写的搜索及其最佳实践。)
简单索引定义
{
"mappings": {
"properties": {
"title": {
"type": "text"
}
}
}
}
索引所有 4 个示例文档
{
"title" : "Nike shoes are worn by Michael Jordan"
}
{
"title" : "Michael Jordan wears shoes from Nike"
}
{
"title" : "Mike Jordan"
}
{
"title" : "Nike is owned by Michael Jordan"
}
前缀搜索查询
{
"query": {
"bool": {
"must": [
{
"prefix": {
"title": {
"value": "nik"
}
}
},
{
"prefix": {
"title": {
"value": "shoe"
}
}
},
{
"prefix": {
"title": {
"value": "mic"
}
}
},
{
"prefix": {
"title": {
"value": "jord"
}
}
}
]
}
}
}
以及您的预期结果
"hits": [
{
"_index": "prefix",
"_type": "_doc",
"_id": "1",
"_score": 4.0,
"_source": {
"title": "Nike shoes are worn by Michael Jordan"
}
},
{
"_index": "prefix",
"_type": "_doc",
"_id": "2",
"_score": 4.0,
"_source": {
"title": "Michael Jordan wears shoes from Nike"
}
}
我正在使用 match_phrase_prefix
,但我希望查询中的所有单词都被视为前缀,并且这些前缀必须出现在文档中,而不管顺序如何。文档中的额外标记没问题。
例如搜索Nik shoe Mic Jord应该匹配:
- Nike shoes 由 Michael Jord 穿着一个
- Michael Jordan 穿着 shoes 来自 Nike
但是,以下不应匹配:
- Mike Jordan(因为只有前缀 Jord )
- Nike 为 Michael Jordan 所有(因为前缀鞋缺失)
所以问题是:如何将所有单词都视为前缀,以及如何确保所有前缀都出现在文档中?
Returns documents that contain the words of a provided text, in the same order as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term.
所以 "Nik shoe Mic Jord" 将只对 Jord 进行短语搜索,并且标记也必须以相同的顺序出现。
对于所有标记的短语搜索,使用 edge n gram
The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.
映射
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2, --> size of tokens
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
文档:
[
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6RTKTXEBLqTvxU9z8bl3",
"_score" : 1.0,
"_source" : {
"title" : "Nike shoes are worn by Michael Jordan"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6hTLTXEBLqTvxU9zIrks",
"_score" : 1.0,
"_source" : {
"title" : "Michael Jordan wears shoes from Nike"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6xTLTXEBLqTvxU9zQbm4",
"_score" : 1.0,
"_source" : {
"title" : "Mike Jordan"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "7BTLTXEBLqTvxU9zbLkT",
"_score" : 1.0,
"_source" : {
"title" : "Nike is owned by Michael Jordan"
}
}
]
查询:
{
"query": {
"match": {
"title": {
"query": "Nik shoe Mic Jord",
"operator": "and" --> all tokens are needed
}
}
}
}
结果:
[
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6RTKTXEBLqTvxU9z8bl3",
"_score" : 3.2434955,
"_source" : {
"title" : "Nike shoes are worn by Michael Jordan"
}
},
{
"_index" : "index80",
"_type" : "_doc",
"_id" : "6hTLTXEBLqTvxU9zIrks",
"_score" : 3.1820722,
"_source" : {
"title" : "Michael Jordan wears shoes from Nike"
}
}
]
对于带有 min-gram:2 和 max-gram:5 的单词 "michael" 生成以下标记
{
"token" : "Mi",
"start_offset" : 23,
"end_offset" : 25,
"type" : "word",
"position" : 13
},
{
"token" : "Mic",
"start_offset" : 23,
"end_offset" : 26,
"type" : "word",
"position" : 14
},
{
"token" : "Mich",
"start_offset" : 23,
"end_offset" : 27,
"type" : "word",
"position" : 15
},
{
"token" : "Micha",
"start_offset" : 23,
"end_offset" : 28,
"type" : "word",
"position" : 16
},
{
"token" : "Michae",
"start_offset" : 23,
"end_offset" : 29,
"type" : "word",
"position" : 17
},
{
"token" : "Michael",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 18
}
2 到 5 个单词的标记大小,因此选择最小和最大克数很重要。大的差异会导致你的索引膨胀,小的差异会导致文档不匹配
虽然@jaspreet 给出的解决方案可行,但它需要创建边缘 n-gram 标记,除了找到 min and max ngram
的正确平衡之外,它可能会创建巨大的索引大小,对于小的部分查询和 会导致性能问题,并且一次又一次地更改此设置将需要重新编制索引。
另一种解决方案是使用 the prefix queries, specially created for such use-cases,如下例所示,只有一个警告,因为您正在使用区分大小写的搜索(理想情况下,您应该将所有搜索词小写以使其成为不区分大小写的搜索及其最佳实践。)
简单索引定义
{
"mappings": {
"properties": {
"title": {
"type": "text"
}
}
}
}
索引所有 4 个示例文档
{
"title" : "Nike shoes are worn by Michael Jordan"
}
{
"title" : "Michael Jordan wears shoes from Nike"
}
{
"title" : "Mike Jordan"
}
{
"title" : "Nike is owned by Michael Jordan"
}
前缀搜索查询
{
"query": {
"bool": {
"must": [
{
"prefix": {
"title": {
"value": "nik"
}
}
},
{
"prefix": {
"title": {
"value": "shoe"
}
}
},
{
"prefix": {
"title": {
"value": "mic"
}
}
},
{
"prefix": {
"title": {
"value": "jord"
}
}
}
]
}
}
}
以及您的预期结果
"hits": [
{
"_index": "prefix",
"_type": "_doc",
"_id": "1",
"_score": 4.0,
"_source": {
"title": "Nike shoes are worn by Michael Jordan"
}
},
{
"_index": "prefix",
"_type": "_doc",
"_id": "2",
"_score": 4.0,
"_source": {
"title": "Michael Jordan wears shoes from Nike"
}
}