如果同义词是多词,Elasticsearch 同义词标记过滤器如何工作?
Elasticsearch how synonym token filter works if synonym is multi-word?
如果同义词是多词表达式并且分词器是空格,有人可以解释一下同义词分词过滤器是如何工作的吗?例如。如果我有这个简单的映射
PUT /test_index
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "whitespace",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym_graph" : {
"type" : "synonym",
"lenient": true,
"synonyms" : ["multi word, bar => baz"]
}
}
}
}
}
}
我不明白如果 whitespace tokenizer 将术语 multi word 分成 two words 怎么可能评估它 多和字。因此,据我了解,同义词过滤器永远不会将“multi word”作为在配置中查找同义词的一个术语。任何帮助表示赞赏。
答案可以在这部分找到
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/token-graphs.html
和这个博客 post
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
Some token filters can add tokens that span multiple positions. These can include tokens for multi-word synonyms, such as using "atm" as a synonym for "automatic teller machine". However, only some token filters, known as graph token filters, accurately record the positionLength for multi-position tokens.
Indexing ignores the positionLength attribute and does not support token graphs containing multi-position tokens. However, queries, such as the match or match_phrase query, can use these graphs to generate multiple sub-queries from a single query string.
The following token filters can add tokens that span multiple positions but only record a default positionLength of 1:
- synonym
- word_delimiter
This means these filters will produce invalid token graphs for streams containing such tokens.
Avoid using invalid token graphs for search. Invalid graphs can cause unexpected search results.
如果同义词是多词表达式并且分词器是空格,有人可以解释一下同义词分词过滤器是如何工作的吗?例如。如果我有这个简单的映射
PUT /test_index
{
"settings": {
"index" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "whitespace",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym_graph" : {
"type" : "synonym",
"lenient": true,
"synonyms" : ["multi word, bar => baz"]
}
}
}
}
}
}
我不明白如果 whitespace tokenizer 将术语 multi word 分成 two words 怎么可能评估它 多和字。因此,据我了解,同义词过滤器永远不会将“multi word”作为在配置中查找同义词的一个术语。任何帮助表示赞赏。
答案可以在这部分找到
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/token-graphs.html
和这个博客 post
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
Some token filters can add tokens that span multiple positions. These can include tokens for multi-word synonyms, such as using "atm" as a synonym for "automatic teller machine". However, only some token filters, known as graph token filters, accurately record the positionLength for multi-position tokens.
Indexing ignores the positionLength attribute and does not support token graphs containing multi-position tokens. However, queries, such as the match or match_phrase query, can use these graphs to generate multiple sub-queries from a single query string.
The following token filters can add tokens that span multiple positions but only record a default positionLength of 1:
- synonym
- word_delimiter
This means these filters will produce invalid token graphs for streams containing such tokens.
Avoid using invalid token graphs for search. Invalid graphs can cause unexpected search results.