Elasticsearch 中的同义词相关性问题

Synonyms relevance issue in Elasticsearch

我正在尝试在 elasticsearch 中配置同义词并完成示例配置。但是当我搜索数据时没有得到预期的相关性。 下面是索引映射配置:

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": [
              "mind, brain",
              "brainstorm,brain storm"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase"
            ]
          },
          "my_search_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

以下是我已编制索引的示例数据:

POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }

下面是我正在尝试的查询:

GET test_index/_search
{
  "query": {
    "match": {
      "my_field": {
        "query": "brainstorm",
         "analyzer": "my_search_analyzer"
      }
    }
  }
}

当前结果:

 "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.8185701,
        "_source" : {
          "my_field" : "A different brain storm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.4100728,
        "_source" : {
          "my_field" : "I had a storm in my brain"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.90928507,
        "_source" : {
          "my_field" : "This is a brainstorm"
        }
      }
    ]

我希望与 exect 匹配的文档位于顶部,而与同义词匹配的文档应该具有低分。 所以我的期望是价值“这是一场头脑风暴”的文件应该排在第一位。

能否建议我如何实现。

我也尝试过应用提升和加权,但没有成功。

提前致谢!!!

Elasticsearch 将一个同义词的每个实例“替换”为所有其他同义词,并在索引和搜索时这样做(除非您提供单独的 search_analyzer),因此您会丢失确切的标记。要保留此信息,请使用 subfield with standard analyzer and then use multi_match 查询来匹配同义词或精确值 + 提升精确字段。

我从 Elastic 论坛 here 得到了答案。我已在下面复制以供快速参考。

你好,

由于您将同义词索引到倒排索引中,因此头脑风暴和头脑风暴在分析器完成工作后都是不同的标记。因此,查询时的 Elasticsearch 使用您的分析器从您的查询中为 brain、storm 和 brainstorm 创建标记,并将多个标记与索引 2 和 4 匹配,您的索引 2 的单词较少,因此 tf/idf 在两者和索引之间得分较高数字 1 只匹配头脑风暴。

您还可以通过此查看您的分析器对您的输入做了什么;

POST test_index/_analyze
{
  "analyzer": "my_search_analyzer",
  "text": "I had a storm in my brain"
}

我做了一些尝试,您应该将索引分析器更改为 my_analyzer;

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": [
              "mind, brain",
              "brainstorm,brain storm"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase"
            ]
          },
          "my_search_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

然后你想提高你的精确匹配,但你也想从 my_search_analyzer 令牌中获得匹配,所以我稍微改变了你的查询;

GET test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "brainstorm",
              "analyzer": "my_search_analyzer"
            }
          }
        },
        {
          "match_phrase": {
            "my_field": {
              "query": "brainstorm"
            }
          }
        }
      ]
    }
  }
}

结果:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 2.3491273,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3491273,
        "_source" : {
          "my_field" : "This is a brainstorm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.8185701,
        "_source" : {
          "my_field" : "A different brain storm"
        }
      }
    ]
  }
 }