如何让 Elasticsearch 突出显示 search_as_you_type 字段中的部分单词?
How do I get Elasticsearch to highlight a partial word from a search_as_you_type field?
我在按照此处的指南 https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-as-you-type.html
设置突出显示的 search_as_you_type 字段时遇到问题
我将留下一系列命令来重现我所看到的。希望有人可以权衡我所缺少的:)
- 创建映射
PUT /test_index
{
"mappings": {
"properties": {
"plain_text": {
"type": "search_as_you_type",
"index_options": "offsets",
"term_vector": "with_positions_offsets"
}
}
}
}
- 插入文档
POST /test_index/_doc
{
"plain_text": "This is some random text"
}
- 搜索文档
GET /snippets_test/_search
{
"query": {
"multi_match": {
"query": "rand",
"type": "bool_prefix",
"fields": [
"plain_text",
"plain_text._2gram",
"plain_text._3gram",
"plain_text._index_prefix"
]
}
},
"highlight" : {
"fields" : [
{
"plain_text": {
"number_of_fragments": 1,
"no_match_size": 100
}
}
]
}
}
- 回应
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "rLZkjm8BDC17cLikXRbY",
"_score" : 1.0,
"_source" : {
"plain_text" : "This is some random text"
},
"highlight" : {
"plain_text" : [
"This is some random text"
]
}
}
]
}
}
我得到的回复没有我期望的突出显示
理想的亮点是:This is some <em>ran</em>dom text
为了突出显示 n-gram(字符),您需要:
- 自定义 ngram 分词器。默认情况下,
min_gram
和 max_gram
之间的最大差异为 1,因此在我的示例中,突出显示仅适用于长度为 3 或 4 的搜索词。您可以通过设置更改此设置并创建更多 n-gram index.max_ngram_diff
的更高值。
- 基于自定义分词器的自定义分析器
- 在映射中添加 "plain_text.highlight" 字段
配置如下:
{
"settings": {
"analysis": {
"analyzer": {
"partial_words" : {
"type": "custom",
"tokenizer": "ngrams",
"filter": ["lowercase"]
}
},
"tokenizer": {
"ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4
}
}
}
},
"mappings": {
"properties": {
"plain_text": {
"type": "text",
"fields": {
"shingles": {
"type": "search_as_you_type"
},
"ngrams": {
"type": "text",
"analyzer": "partial_words",
"search_analyzer": "standard",
"term_vector": "with_positions_offsets"
}
}
}
}
}
}
查询:
{
"query": {
"multi_match": {
"query": "rand",
"type": "bool_prefix",
"fields": [
"plain_text.shingles",
"plain_text.shingles._2gram",
"plain_text.shingles._3gram",
"plain_text.shingles._index_prefix",
"plain_text.ngrams"
]
}
},
"highlight" : {
"fields" : [
{
"plain_text.ngrams": { }
}
]
}
}
结果:
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "FkHLVHABd_SGa-E-2FKI",
"_score": 2,
"_source": {
"plain_text": "This is some random text"
},
"highlight": {
"plain_text.ngrams": [
"This is some <em>rand</em>om text"
]
}
}
]
注意:在某些情况下,此配置在内存使用和存储方面的开销可能很大。
我在按照此处的指南 https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-as-you-type.html
设置突出显示的 search_as_you_type 字段时遇到问题我将留下一系列命令来重现我所看到的。希望有人可以权衡我所缺少的:)
- 创建映射
PUT /test_index
{
"mappings": {
"properties": {
"plain_text": {
"type": "search_as_you_type",
"index_options": "offsets",
"term_vector": "with_positions_offsets"
}
}
}
}
- 插入文档
POST /test_index/_doc
{
"plain_text": "This is some random text"
}
- 搜索文档
GET /snippets_test/_search
{
"query": {
"multi_match": {
"query": "rand",
"type": "bool_prefix",
"fields": [
"plain_text",
"plain_text._2gram",
"plain_text._3gram",
"plain_text._index_prefix"
]
}
},
"highlight" : {
"fields" : [
{
"plain_text": {
"number_of_fragments": 1,
"no_match_size": 100
}
}
]
}
}
- 回应
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "rLZkjm8BDC17cLikXRbY",
"_score" : 1.0,
"_source" : {
"plain_text" : "This is some random text"
},
"highlight" : {
"plain_text" : [
"This is some random text"
]
}
}
]
}
}
我得到的回复没有我期望的突出显示
理想的亮点是:This is some <em>ran</em>dom text
为了突出显示 n-gram(字符),您需要:
- 自定义 ngram 分词器。默认情况下,
min_gram
和max_gram
之间的最大差异为 1,因此在我的示例中,突出显示仅适用于长度为 3 或 4 的搜索词。您可以通过设置更改此设置并创建更多 n-gramindex.max_ngram_diff
的更高值。 - 基于自定义分词器的自定义分析器
- 在映射中添加 "plain_text.highlight" 字段
配置如下:
{
"settings": {
"analysis": {
"analyzer": {
"partial_words" : {
"type": "custom",
"tokenizer": "ngrams",
"filter": ["lowercase"]
}
},
"tokenizer": {
"ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4
}
}
}
},
"mappings": {
"properties": {
"plain_text": {
"type": "text",
"fields": {
"shingles": {
"type": "search_as_you_type"
},
"ngrams": {
"type": "text",
"analyzer": "partial_words",
"search_analyzer": "standard",
"term_vector": "with_positions_offsets"
}
}
}
}
}
}
查询:
{
"query": {
"multi_match": {
"query": "rand",
"type": "bool_prefix",
"fields": [
"plain_text.shingles",
"plain_text.shingles._2gram",
"plain_text.shingles._3gram",
"plain_text.shingles._index_prefix",
"plain_text.ngrams"
]
}
},
"highlight" : {
"fields" : [
{
"plain_text.ngrams": { }
}
]
}
}
结果:
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "FkHLVHABd_SGa-E-2FKI",
"_score": 2,
"_source": {
"plain_text": "This is some random text"
},
"highlight": {
"plain_text.ngrams": [
"This is some <em>rand</em>om text"
]
}
}
]
注意:在某些情况下,此配置在内存使用和存储方面的开销可能很大。