如何在弹性搜索索引中一起使用 ngram 和边缘 ngram 分词器?
How to use an ngram and edge ngram tokenizer together in elasticsearch index?
我有一个包含 3 个文档的索引。
{
"firstname": "Anne",
"lastname": "Borg",
}
{
"firstname": "Leanne",
"lastname": "Ray"
},
{
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
当我搜索 "Anne" 时,我想弹性地 return 所有这 3 个文档(因为它们都在一定程度上匹配术语 "Anne" ).但是,我希望 Leanne Ray 的分数(相关性排名)较低,因为搜索词 "Anne" 在本文档中出现的位置比其他两个文档中出现的词要晚。
最初,我使用的是 ngram 分词器。我的索引映射中还有一个名为 "full_name" 的生成字段,其中包含名字、中间名和姓氏字符串。当我搜索 "Anne" 时,所有 3 个文档都在结果集中。然而,Anne M Stone 的得分与 Leanne Ray 相同。 Anne M Stone 的得分应该高于 Leanne。
为了解决这个问题,我将我的 ngram 分词器更改为 edge_ngram 分词器。这具有从结果集中完全排除 Leanne Ray 的效果。我们希望将此结果保留在结果集中 - 因为它仍然包含查询字符串 - 但得分低于其他两个更好的匹配项。
我在某处读到,可以在同一索引中将边缘 ngram 过滤器与 ngram 过滤器一起使用。如果是这样,我应该如何重新创建索引呢?有更好的解决方案吗?
这是初始索引设置。
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit",
"custom"
],
"custom_token_chars": "'-",
"min_gram": "3",
"type": "ngram",
"max_gram": "4"
}
}
}
},
"mappings": {
"properties": {
"contact_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
这是我的查询
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "Anne",
"fields": [
"full_name"
]
}
}
]
}
}
}
这带回了这些结果
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.59604377,
"hits": [
{
"_index": "contacts_15",
"_type": "_doc",
"_id": "3",
"_score": 0.59604377,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
},
{
"_index": "contacts_15",
"_type": "_doc",
"_id": "1",
"_score": 0.5592099,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_15",
"_type": "_doc",
"_id": "2",
"_score": 0.5592099,
"_source": {
"firstname": "Leanne",
"lastname": "Ray"
}
}
]
}
如果我改为使用边缘 ngram 分词器,这就是索引设置的样子...
{
"settings": {
"max_ngram_diff": "10",
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": ["edge_ngram_tokenizer"]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"token_chars": [
"letter",
"digit",
"custom"
],
"custom_token_chars": "'-",
"min_gram": "2",
"type": "edge_ngram",
"max_gram": "10"
}
}
}
},
"mappings": {
"properties": {
"contact_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
同一个查询会返回这个新的结果集...
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.5131824,
"hits": [
{
"_index": "contacts_16",
"_type": "_doc",
"_id": "3",
"_score": 1.5131824,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_16",
"_type": "_doc",
"_id": "1",
"_score": 1.4100108,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
}
]
}
您可以继续使用 ngram(即第一个解决方案),但随后您需要更改查询以提高相关性。它的工作方式是在 should
子句中添加一个增强的 multi_match
查询,以增加名字或姓氏与输入完全匹配的文档的分数:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Anne",
"fields": [
"full_name"
]
}
}
],
"should": [
{
"multi_match": {
"query": "Anne",
"fields": [
"firstname",
"lastname"
],
"boost": 10
}
}
]
}
}
}
此查询将在 Leanne Ray
之前带来 Anne Borg
和 Anne M Stone
。
更新
这是我得出结果的方式。
首先,我创建了一个 与您添加到问题中的 settings/mappings 完全相同的测试索引:
PUT test
{ ... copy/pasted mappings/settings ... }
然后我添加了您提供的三个示例文档:
POST test/_doc/_bulk
{"index":{}}
{"firstname":"Anne","lastname":"Borg"}
{"index":{}}
{"firstname":"Leanne","lastname":"Ray"}
{"index":{}}
{"firstname":"Anne","middlename":"M","lastname":"Stone"}
最后,如果你 运行 我上面的查询,你会得到以下结果,这正是你所期望的(看分数):
{
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 5.1328206,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4ZqbDHIBhYuDqANwQ-ih",
"_score" : 5.1328206,
"_source" : {
"firstname" : "Anne",
"lastname" : "Borg"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "45qbDHIBhYuDqANwQ-ih",
"_score" : 5.0862665,
"_source" : {
"firstname" : "Anne",
"middlename" : "M",
"lastname" : "Stone"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4pqbDHIBhYuDqANwQ-ih",
"_score" : 0.38623023,
"_source" : {
"firstname" : "Leanne",
"lastname" : "Ray"
}
}
]
}
}
我有一个包含 3 个文档的索引。
{
"firstname": "Anne",
"lastname": "Borg",
}
{
"firstname": "Leanne",
"lastname": "Ray"
},
{
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
当我搜索 "Anne" 时,我想弹性地 return 所有这 3 个文档(因为它们都在一定程度上匹配术语 "Anne" ).但是,我希望 Leanne Ray 的分数(相关性排名)较低,因为搜索词 "Anne" 在本文档中出现的位置比其他两个文档中出现的词要晚。
最初,我使用的是 ngram 分词器。我的索引映射中还有一个名为 "full_name" 的生成字段,其中包含名字、中间名和姓氏字符串。当我搜索 "Anne" 时,所有 3 个文档都在结果集中。然而,Anne M Stone 的得分与 Leanne Ray 相同。 Anne M Stone 的得分应该高于 Leanne。
为了解决这个问题,我将我的 ngram 分词器更改为 edge_ngram 分词器。这具有从结果集中完全排除 Leanne Ray 的效果。我们希望将此结果保留在结果集中 - 因为它仍然包含查询字符串 - 但得分低于其他两个更好的匹配项。
我在某处读到,可以在同一索引中将边缘 ngram 过滤器与 ngram 过滤器一起使用。如果是这样,我应该如何重新创建索引呢?有更好的解决方案吗?
这是初始索引设置。
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit",
"custom"
],
"custom_token_chars": "'-",
"min_gram": "3",
"type": "ngram",
"max_gram": "4"
}
}
}
},
"mappings": {
"properties": {
"contact_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
这是我的查询
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "Anne",
"fields": [
"full_name"
]
}
}
]
}
}
}
这带回了这些结果
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.59604377,
"hits": [
{
"_index": "contacts_15",
"_type": "_doc",
"_id": "3",
"_score": 0.59604377,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
},
{
"_index": "contacts_15",
"_type": "_doc",
"_id": "1",
"_score": 0.5592099,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_15",
"_type": "_doc",
"_id": "2",
"_score": 0.5592099,
"_source": {
"firstname": "Leanne",
"lastname": "Ray"
}
}
]
}
如果我改为使用边缘 ngram 分词器,这就是索引设置的样子...
{
"settings": {
"max_ngram_diff": "10",
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": ["edge_ngram_tokenizer"]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"token_chars": [
"letter",
"digit",
"custom"
],
"custom_token_chars": "'-",
"min_gram": "2",
"type": "edge_ngram",
"max_gram": "10"
}
}
}
},
"mappings": {
"properties": {
"contact_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"copy_to": [
"full_name"
]
},
"middlename": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"copy_to": [
"full_name"
]
},
"full_name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
同一个查询会返回这个新的结果集...
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.5131824,
"hits": [
{
"_index": "contacts_16",
"_type": "_doc",
"_id": "3",
"_score": 1.5131824,
"_source": {
"firstname": "Anne",
"middlename": "M",
"lastname": "Stone"
}
},
{
"_index": "contacts_16",
"_type": "_doc",
"_id": "1",
"_score": 1.4100108,
"_source": {
"firstname": "Anne",
"lastname": "Borg"
}
}
]
}
您可以继续使用 ngram(即第一个解决方案),但随后您需要更改查询以提高相关性。它的工作方式是在 should
子句中添加一个增强的 multi_match
查询,以增加名字或姓氏与输入完全匹配的文档的分数:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Anne",
"fields": [
"full_name"
]
}
}
],
"should": [
{
"multi_match": {
"query": "Anne",
"fields": [
"firstname",
"lastname"
],
"boost": 10
}
}
]
}
}
}
此查询将在 Leanne Ray
之前带来 Anne Borg
和 Anne M Stone
。
更新
这是我得出结果的方式。
首先,我创建了一个 与您添加到问题中的 settings/mappings 完全相同的测试索引:
PUT test
{ ... copy/pasted mappings/settings ... }
然后我添加了您提供的三个示例文档:
POST test/_doc/_bulk
{"index":{}}
{"firstname":"Anne","lastname":"Borg"}
{"index":{}}
{"firstname":"Leanne","lastname":"Ray"}
{"index":{}}
{"firstname":"Anne","middlename":"M","lastname":"Stone"}
最后,如果你 运行 我上面的查询,你会得到以下结果,这正是你所期望的(看分数):
{
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 5.1328206,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4ZqbDHIBhYuDqANwQ-ih",
"_score" : 5.1328206,
"_source" : {
"firstname" : "Anne",
"lastname" : "Borg"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "45qbDHIBhYuDqANwQ-ih",
"_score" : 5.0862665,
"_source" : {
"firstname" : "Anne",
"middlename" : "M",
"lastname" : "Stone"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "4pqbDHIBhYuDqANwQ-ih",
"_score" : 0.38623023,
"_source" : {
"firstname" : "Leanne",
"lastname" : "Ray"
}
}
]
}
}