如何让 Elasticsearch 突出显示 search_as_you_type 字段中的部分单词?

How do I get Elasticsearch to highlight a partial word from a search_as_you_type field?

我在按照此处的指南 https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-as-you-type.html

设置突出显示的 search_as_you_type 字段时遇到问题

我将留下一系列命令来重现我所看到的。希望有人可以权衡我所缺少的:)

  1. 创建映射
PUT /test_index
{
  "mappings": {
    "properties": {
      "plain_text": {
        "type": "search_as_you_type",
        "index_options": "offsets",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}
  1. 插入文档
POST /test_index/_doc
{
  "plain_text": "This is some random text"
}
  1. 搜索文档
GET /snippets_test/_search
{
  "query": {
    "multi_match": {
      "query": "rand",
      "type": "bool_prefix",
      "fields": [
        "plain_text",
        "plain_text._2gram",
        "plain_text._3gram",
        "plain_text._index_prefix"
      ]
    }
  },
  "highlight" : {
    "fields" : [
      {
        "plain_text": {
          "number_of_fragments": 1,
          "no_match_size": 100
        } 
      }
    ]
  }
}
  1. 回应
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "rLZkjm8BDC17cLikXRbY",
        "_score" : 1.0,
        "_source" : {
          "plain_text" : "This is some random text"
        },
        "highlight" : {
          "plain_text" : [
            "This is some random text"
          ]
        }
      }
    ]
  }
}

我得到的回复没有我期望的突出显示 理想的亮点是:This is some <em>ran</em>dom text

为了突出显示 n-gram(字符),您需要:

  • 自定义 ngram 分词器。默认情况下,min_grammax_gram 之间的最大差异为 1,因此在我的示例中,突出显示仅适用于长度为 3 或 4 的搜索词。您可以通过设置更改此设置并创建更多 n-gram index.max_ngram_diff 的更高值。
  • 基于自定义分词器的自定义分析器
  • 在映射中添加 "plain_text.highlight" 字段

配置如下:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "partial_words" : {
          "type": "custom",
          "tokenizer": "ngrams",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "ngrams": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 4
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "plain_text": {
        "type": "text",
        "fields": {
          "shingles": { 
            "type": "search_as_you_type"
          },
          "ngrams": {
            "type": "text",
            "analyzer": "partial_words",
            "search_analyzer": "standard",
            "term_vector": "with_positions_offsets"
          }
        }
      }
    }
  }
}

查询:

{
  "query": {
    "multi_match": {
      "query": "rand",
      "type": "bool_prefix",
      "fields": [
        "plain_text.shingles",
        "plain_text.shingles._2gram",
        "plain_text.shingles._3gram",
        "plain_text.shingles._index_prefix",
        "plain_text.ngrams"
      ]
    }
  },
  "highlight" : {
    "fields" : [
      {
        "plain_text.ngrams": { } 
      }
    ]
  }
}

结果:

    "hits": [
        {
            "_index": "test_index",
            "_type": "_doc",
            "_id": "FkHLVHABd_SGa-E-2FKI",
            "_score": 2,
            "_source": {
                "plain_text": "This is some random text"
            },
            "highlight": {
                "plain_text.ngrams": [
                    "This is some <em>rand</em>om text"
                ]
            }
        }
    ]

注意:在某些情况下,此配置在内存使用和存储方面的开销可能很大。