与多路复用器级别相比,弹性搜索同义词过滤器在分析器级别应用时的行为不同

Elastic Search Synonym filter behaviour is different when applied at analyzer filter compared to multiplexer level

同义词过滤器,即 my_synonym 当我在分析器级别应用时按预期工作

PUT /test_index?pretty
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
             "my_synonym" 
            ]
          }
        },
        "filter": {
          "my_synonym": {
            "type": "synonym",
            "synonyms": [
              "foo, bar => baz",
              "The hero => CaptainAmerica"
            ]
          },
          "my_multiplexer": {
            "type": "multiplexer",
            "filters": [
              "my_synonym"
            ]
          }
        }
      }
    }
  }
}

而执行时

GET /test_index/_analyze?pretty
{
   "analyzer": "my_analyzer",
  "text": "The hero bar"
}

我的输出低于我的预期

{
  "tokens" : [
    {
      "token" : "CaptainAmerica",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "baz",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

但是当我将 my_synonym 过滤器应用于 my_multiplexer 并将 my_multiplexer 注入分析器过滤器会产生不同的结果

PUT /test_index?pretty
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "my_multiplexer"
            ]
          }
        },
        "filter": {
          "my_synonym": {
            "type": "synonym",
            "synonyms": [
              "foo, bar => baz",
              "The hero => CaptainAmerica"
            ]
          },
          "my_multiplexer": {
            "type": "multiplexer",
            "filters": [
              "my_synonym"
            ]
          }
        }
      }
    }
  }
}

同一查询的结果是

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "hero",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "bar",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "baz",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 2
    }
  ]
}

我观察到,如果我在同义词列表中使用逗号 (,) 代替空格“”,它会按预期工作,但我需要将几个单词和 link 连接到单个实体。

请告诉我出了什么问题或其他解决方法

发现问题。

显然,我面临着多路复用器过滤器输出端使用的“删除重复项”过滤器的副作用。在“删除重复项”过滤器的代码中,我们可以看到:

boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));

解决方法:

在“RemoveDuplicatesTokenFilter”的测试中用“equals”替换“contains”String 方法