Elasticsearch 匹配短语前缀不匹配所有术语
Elasticsearch match phrase prefix not matching all terms
我遇到一个问题,当我在 Elasticsearch 中使用 match_phrase_prefix 查询时,它没有返回我期望的所有结果,尤其是当查询是一个单词后跟一个字母时。
拿这个索引映射(这是一个保护敏感数据的人为例子):
http://localhost:9200/test/drinks/_mapping
returns:
{
"test": {
"mappings": {
"drinks": {
"properties": {
"name": {
"type": "text"
}
}
}
}
}
}
在数以百万计的其他记录中有这些:
{
"_index": "test",
"_type": "drinks",
"_id": "2",
"_score": 1,
"_source": {
"name": "Johnnie Walker Black Label"
}
},
{
"_index": "test",
"_type": "drinks",
"_id": "1",
"_score": 1,
"_source": {
"name": "Johnnie Walker Blue Label"
}
}
以下查询,一个单词后跟两个字母:
POST http://localhost:9200/test/drinks/_search
{
"query": {
"match_phrase_prefix" : {
"name" : "Walker Bl"
}
}
}
returns这个:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test",
"_type": "drinks",
"_id": "2",
"_score": 0.5753642,
"_source": {
"name": "Johnnie Walker Black Label"
}
},
{
"_index": "test",
"_type": "drinks",
"_id": "1",
"_score": 0.5753642,
"_source": {
"name": "Johnnie Walker Blue Label"
}
}
]
}
}
而这个查询包含一个单词和一个字母:
POST http://localhost:9200/test/drinks/_search
{
"query": {
"match_phrase_prefix" : {
"name" : "Walker B"
}
}
}
returns 没有结果。这里会发生什么?
我假设您使用的是 Elasticsearch 5.0 及更高版本。
我认为这可能是因为 max_expansions 默认值。
如文档 here 中所示,max_expansions 参数用于控制最后一个术语将扩展多少个前缀。默认值为 50,它可以解释为什么您发现“黑色”和“蓝色”的两个首字母为 B 和 L,而不是只有 B。
文档对此非常清楚:
The match_phrase_prefix query is a poor-man’s autocomplete. It is very easy to use, which let’s you get started quickly with search-as-you-type but it’s results, which usually are good enough, can sometimes be confusing.
Consider the query string quick brown f. This query works by creating a phrase query out of quick and brown (i.e. the term quick must exist and must be followed by the term brown). Then it looks at the sorted term dictionary to find the first 50 terms that begin with f, and adds these terms to the phrase query.
The problem is that the first 50 terms may not include the term fox so the phase quick brown fox will not be found. This usually isn’t a problem as the user will continue to type more letters until the word they are looking for appears
如果您正在寻找好的性能,我无法告诉您是否可以将此参数增加到 50 以上,因为我自己从未尝试过。
我遇到一个问题,当我在 Elasticsearch 中使用 match_phrase_prefix 查询时,它没有返回我期望的所有结果,尤其是当查询是一个单词后跟一个字母时。
拿这个索引映射(这是一个保护敏感数据的人为例子):
http://localhost:9200/test/drinks/_mapping
returns:
{
"test": {
"mappings": {
"drinks": {
"properties": {
"name": {
"type": "text"
}
}
}
}
}
}
在数以百万计的其他记录中有这些:
{
"_index": "test",
"_type": "drinks",
"_id": "2",
"_score": 1,
"_source": {
"name": "Johnnie Walker Black Label"
}
},
{
"_index": "test",
"_type": "drinks",
"_id": "1",
"_score": 1,
"_source": {
"name": "Johnnie Walker Blue Label"
}
}
以下查询,一个单词后跟两个字母:
POST http://localhost:9200/test/drinks/_search
{
"query": {
"match_phrase_prefix" : {
"name" : "Walker Bl"
}
}
}
returns这个:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test",
"_type": "drinks",
"_id": "2",
"_score": 0.5753642,
"_source": {
"name": "Johnnie Walker Black Label"
}
},
{
"_index": "test",
"_type": "drinks",
"_id": "1",
"_score": 0.5753642,
"_source": {
"name": "Johnnie Walker Blue Label"
}
}
]
}
}
而这个查询包含一个单词和一个字母:
POST http://localhost:9200/test/drinks/_search
{
"query": {
"match_phrase_prefix" : {
"name" : "Walker B"
}
}
}
returns 没有结果。这里会发生什么?
我假设您使用的是 Elasticsearch 5.0 及更高版本。 我认为这可能是因为 max_expansions 默认值。
如文档 here 中所示,max_expansions 参数用于控制最后一个术语将扩展多少个前缀。默认值为 50,它可以解释为什么您发现“黑色”和“蓝色”的两个首字母为 B 和 L,而不是只有 B。
文档对此非常清楚:
The match_phrase_prefix query is a poor-man’s autocomplete. It is very easy to use, which let’s you get started quickly with search-as-you-type but it’s results, which usually are good enough, can sometimes be confusing.
Consider the query string quick brown f. This query works by creating a phrase query out of quick and brown (i.e. the term quick must exist and must be followed by the term brown). Then it looks at the sorted term dictionary to find the first 50 terms that begin with f, and adds these terms to the phrase query.
The problem is that the first 50 terms may not include the term fox so the phase quick brown fox will not be found. This usually isn’t a problem as the user will continue to type more letters until the word they are looking for appears
如果您正在寻找好的性能,我无法告诉您是否可以将此参数增加到 50 以上,因为我自己从未尝试过。