当单词以 n-gram 开头时,Elasticsearch Edge NGram 分词器得分更高

Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

假设有以下与 Edge NGram Tokenizer 的映射:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_analyzer": {
          "tokenizer": "autocomplete_tokenizer",
          "filter": [
            "standard"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "whitespace"
        }
      },
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "symbol"
          ]
        }
      }
    }
  },
  "mappings": {
    "tag": {
      "properties": {
        "id": {
          "type": "long"
        },
        "name": {
          "type": "text",
          "analyzer": "autocomplete_analyzer",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

并且索引了以下文档:

POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}

然后搜索

{
  "query": {
    "match": {
      "name": {
        "query": "HI"
      }
    }
  }
}

得分相同,或者 TRENDING - HI 得分高于其他一个。

如何配置,以更高的分数显示实际以搜索者 n-gram 开头的条目?在这种情况下,HITS FIND SOMEHITS OTHER 的得分高于 TRENDING HI;同时 TRENDING HI 应该在结果中。

也使用了荧光笔,所以给定的解决方案应该不会搞砸。

查询中使用的荧光笔是:

 "highlight": {
    "pre_tags": [
      "<"
    ],
    "post_tags": [
      ">"
    ],
    "fields": {
      "name": {}
    }
  }

将它与 match_phrase_prefix 一起使用会使突出显示混乱,在仅搜索 H.

时产生 <H><I><T><S> FIND SOME

在这种特殊情况下,您可以在查询中添加 match_phrase_prefix 字词,它会与文本中的最后一个字词进行前缀匹配:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name": "HI"
          }
        },
        {
          "match_phrase_prefix": {
            "name": "HI"
          }
        }
      ]
    }
  }
}

match 项将匹配所有三个结果,但 match_phrase_prefix 将不匹配 TRENDING HI。因此,您将在结果中获得所有三个项目,但 TRENDING HI 会以较低的分数出现。

引用docs

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

附带说明一下,如果您要引入 bool 查询,您可能需要查看 minimum_should_match 选项,具体取决于您想要的结果。

您必须了解 elasticsearch/lucene 如何分析您的数据并计算搜索分数。

1.分析 API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html 这将向您展示 elasticsearch 将存储什么,在您的情况下:

T / TR / TRE /.... TRENDING / / H / HI

2。得分

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

bool 查询通常用于构建需要特定用例的复杂查询。使用 must 筛选文档,然后 should 进行评分。一个常见的用例是在同一字段上使用不同的分析器(通过在映射中使用关键字 fields,您可以对同一字段进行不同的分析)。

3。不要乱加高亮

根据文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

您可以添加一个额外的查询:

{
  "query": {
    "bool": {
            "must" : [
                        {
          "match": {
            "name": "HI"
          }
        }
            ],
      "should": [
        {
          "prefix": {
            "name": "HI"
          }
        }
      ]
    }
  },
     "highlight": {
    "pre_tags": [
      "<"
    ],
    "post_tags": [
      ">"
    ],
    "fields": {
      "name": {
                "highlight_query": {
                        "match": {
            "name": "HI"
          }
                }
            }
    }
  }
}

这个问题的一个可能的解决方案是使用 multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query 与所有那些不同的 fields 进行比较。

文档的最终得分由每个文档的匹配值组成。这些匹配项也称为 signals,表示查询与文档之间存在匹配项。信号匹配最多的文档获得最高分。

在您的情况下,所有文档都将匹配 ngram HI。但只有 HITS FIND SOMEHITS OTHER 文档会获得 edgengram 额外分数。这将使这两个文档得到提升,并将它们放在首位。这样做的复杂之处在于,您必须确保 edgengram 不会在空格处拆分,因为这样末尾的 HI 将获得与文档开头相同的分数。

以下是针对您的案例的示例映射和查询:

PUT /tag/
{
    "settings": {
        "analysis": {
            "analyzer": {
                "edge_analyzer": {
                    "tokenizer": "edge_tokenizer"
                },
                "kw_analyzer": {
                    "tokenizer": "kw_tokenizer"
                },
                "ngram_analyzer": {
                    "tokenizer": "ngram_tokenizer"
                },
                "autocomplete_analyzer": {
                    "tokenizer": "autocomplete_tokenizer",
                    "filter": [
                        "standard"
                    ]
                },
                "autocomplete_search": {
                    "tokenizer": "whitespace"
                }
            },
            "tokenizer": {
                "kw_tokenizer": {
                    "type": "keyword"
                },
                "edge_tokenizer": {
                    "type": "edge_ngram",
                    "min_gram": 2,
                    "max_gram": 10
                },
                "ngram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                },
                "autocomplete_tokenizer": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "symbol"
                    ]
                }
            }
        }
    },
    "mappings": {
        "tag": {
            "properties": {
                "id": {
                    "type": "long"
                },
                "name": {
                    "type": "text",
                    "fields": {
                        "edge": {
                            "type": "text",
                            "analyzer": "edge_analyzer"
                        },
                        "ngram": {
                            "type": "text",
                            "analyzer": "ngram_analyzer"
                        }
                    }
                }
            }
        }
    }
}

还有一个查询:

POST /tag/_search
{
    "query": {
        "bool": {
            "should": [
                {
                "function_score": {
                    "query": {
                        "match": {
                            "name.edge": {
                                "query": "HI"
                            }
                        }
                    },
                    "boost": "5",
                    "boost_mode": "multiply"
                }
                },
                {
                    "match": {
                        "name.ngram": {
                            "query": "HI"
                        }
                    }
                },
                {
                    "match": {
                        "name": {
                            "query": "HI"
                        }
                    }
                }
            ]
        }
    }
}