与多路复用器级别相比,弹性搜索同义词过滤器在分析器级别应用时的行为不同
Elastic Search Synonym filter behaviour is different when applied at analyzer filter compared to multiplexer level
同义词过滤器,即 my_synonym 当我在分析器级别应用时按预期工作
PUT /test_index?pretty
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_synonym"
]
}
},
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
"foo, bar => baz",
"The hero => CaptainAmerica"
]
},
"my_multiplexer": {
"type": "multiplexer",
"filters": [
"my_synonym"
]
}
}
}
}
}
}
而执行时
GET /test_index/_analyze?pretty
{
"analyzer": "my_analyzer",
"text": "The hero bar"
}
我的输出低于我的预期
{
"tokens" : [
{
"token" : "CaptainAmerica",
"start_offset" : 0,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "baz",
"start_offset" : 9,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1
}
]
}
但是当我将 my_synonym 过滤器应用于 my_multiplexer 并将 my_multiplexer 注入分析器过滤器会产生不同的结果
PUT /test_index?pretty
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_multiplexer"
]
}
},
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
"foo, bar => baz",
"The hero => CaptainAmerica"
]
},
"my_multiplexer": {
"type": "multiplexer",
"filters": [
"my_synonym"
]
}
}
}
}
}
}
同一查询的结果是
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hero",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "bar",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "baz",
"start_offset" : 9,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 2
}
]
}
我观察到,如果我在同义词列表中使用逗号 (,) 代替空格“”,它会按预期工作,但我需要将几个单词和 link 连接到单个实体。
请告诉我出了什么问题或其他解决方法
发现问题。
显然,我面临着多路复用器过滤器输出端使用的“删除重复项”过滤器的副作用。在“删除重复项”过滤器的代码中,我们可以看到:
boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));
解决方法:
在“RemoveDuplicatesTokenFilter”的测试中用“equals”替换“contains”String 方法
同义词过滤器,即 my_synonym 当我在分析器级别应用时按预期工作
PUT /test_index?pretty
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_synonym"
]
}
},
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
"foo, bar => baz",
"The hero => CaptainAmerica"
]
},
"my_multiplexer": {
"type": "multiplexer",
"filters": [
"my_synonym"
]
}
}
}
}
}
}
而执行时
GET /test_index/_analyze?pretty
{
"analyzer": "my_analyzer",
"text": "The hero bar"
}
我的输出低于我的预期
{
"tokens" : [
{
"token" : "CaptainAmerica",
"start_offset" : 0,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "baz",
"start_offset" : 9,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 1
}
]
}
但是当我将 my_synonym 过滤器应用于 my_multiplexer 并将 my_multiplexer 注入分析器过滤器会产生不同的结果
PUT /test_index?pretty
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_multiplexer"
]
}
},
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
"foo, bar => baz",
"The hero => CaptainAmerica"
]
},
"my_multiplexer": {
"type": "multiplexer",
"filters": [
"my_synonym"
]
}
}
}
}
}
}
同一查询的结果是
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hero",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "bar",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "baz",
"start_offset" : 9,
"end_offset" : 12,
"type" : "SYNONYM",
"position" : 2
}
]
}
我观察到,如果我在同义词列表中使用逗号 (,) 代替空格“”,它会按预期工作,但我需要将几个单词和 link 连接到单个实体。
请告诉我出了什么问题或其他解决方法
发现问题。
显然,我面临着多路复用器过滤器输出端使用的“删除重复项”过滤器的副作用。在“删除重复项”过滤器的代码中,我们可以看到:
boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));
解决方法:
在“RemoveDuplicatesTokenFilter”的测试中用“equals”替换“contains”String 方法