Elasticsearch 中的同义词相关性问题
Synonyms relevance issue in Elasticsearch
我正在尝试在 elasticsearch 中配置同义词并完成示例配置。但是当我搜索数据时没有得到预期的相关性。
下面是索引映射配置:
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
以下是我已编制索引的示例数据:
POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }
下面是我正在尝试的查询:
GET test_index/_search
{
"query": {
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
}
}
当前结果:
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.4100728,
"_source" : {
"my_field" : "I had a storm in my brain"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.90928507,
"_source" : {
"my_field" : "This is a brainstorm"
}
}
]
我希望与 exect 匹配的文档位于顶部,而与同义词匹配的文档应该具有低分。
所以我的期望是价值“这是一场头脑风暴”的文件应该排在第一位。
能否建议我如何实现。
我也尝试过应用提升和加权,但没有成功。
提前致谢!!!
Elasticsearch 将一个同义词的每个实例“替换”为所有其他同义词,并在索引和搜索时这样做(除非您提供单独的 search_analyzer),因此您会丢失确切的标记。要保留此信息,请使用 subfield with standard analyzer and then use multi_match 查询来匹配同义词或精确值 + 提升精确字段。
我从 Elastic 论坛 here 得到了答案。我已在下面复制以供快速参考。
你好,
由于您将同义词索引到倒排索引中,因此头脑风暴和头脑风暴在分析器完成工作后都是不同的标记。因此,查询时的 Elasticsearch 使用您的分析器从您的查询中为 brain、storm 和 brainstorm 创建标记,并将多个标记与索引 2 和 4 匹配,您的索引 2 的单词较少,因此 tf/idf 在两者和索引之间得分较高数字 1 只匹配头脑风暴。
您还可以通过此查看您的分析器对您的输入做了什么;
POST test_index/_analyze
{
"analyzer": "my_search_analyzer",
"text": "I had a storm in my brain"
}
我做了一些尝试,您应该将索引分析器更改为 my_analyzer;
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
然后你想提高你的精确匹配,但你也想从 my_search_analyzer 令牌中获得匹配,所以我稍微改变了你的查询;
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
},
{
"match_phrase": {
"my_field": {
"query": "brainstorm"
}
}
}
]
}
}
}
结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.3491273,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.3491273,
"_source" : {
"my_field" : "This is a brainstorm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
}
]
}
}
我正在尝试在 elasticsearch 中配置同义词并完成示例配置。但是当我搜索数据时没有得到预期的相关性。 下面是索引映射配置:
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
以下是我已编制索引的示例数据:
POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }
下面是我正在尝试的查询:
GET test_index/_search
{
"query": {
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
}
}
当前结果:
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.4100728,
"_source" : {
"my_field" : "I had a storm in my brain"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.90928507,
"_source" : {
"my_field" : "This is a brainstorm"
}
}
]
我希望与 exect 匹配的文档位于顶部,而与同义词匹配的文档应该具有低分。 所以我的期望是价值“这是一场头脑风暴”的文件应该排在第一位。
能否建议我如何实现。
我也尝试过应用提升和加权,但没有成功。
提前致谢!!!
Elasticsearch 将一个同义词的每个实例“替换”为所有其他同义词,并在索引和搜索时这样做(除非您提供单独的 search_analyzer),因此您会丢失确切的标记。要保留此信息,请使用 subfield with standard analyzer and then use multi_match 查询来匹配同义词或精确值 + 提升精确字段。
我从 Elastic 论坛 here 得到了答案。我已在下面复制以供快速参考。
你好,
由于您将同义词索引到倒排索引中,因此头脑风暴和头脑风暴在分析器完成工作后都是不同的标记。因此,查询时的 Elasticsearch 使用您的分析器从您的查询中为 brain、storm 和 brainstorm 创建标记,并将多个标记与索引 2 和 4 匹配,您的索引 2 的单词较少,因此 tf/idf 在两者和索引之间得分较高数字 1 只匹配头脑风暴。
您还可以通过此查看您的分析器对您的输入做了什么;
POST test_index/_analyze
{
"analyzer": "my_search_analyzer",
"text": "I had a storm in my brain"
}
我做了一些尝试,您应该将索引分析器更改为 my_analyzer;
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
然后你想提高你的精确匹配,但你也想从 my_search_analyzer 令牌中获得匹配,所以我稍微改变了你的查询;
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
},
{
"match_phrase": {
"my_field": {
"query": "brainstorm"
}
}
}
]
}
}
}
结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.3491273,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.3491273,
"_source" : {
"my_field" : "This is a brainstorm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
}
]
}
}