添加 leading/trailing space 到 elasticsearch tokenizer ngram
Add leading/trailing space to elasticsearch tokenizer ngram
我正在尝试使用 elasticsearch 分析器生成 ngram 功能,特别是,我想在单词中添加 leading/trailing space。例如,如果单词是“2 Quick Foxes”,则具有 leading/trailing space 的 ngram 特征将是:
" 2 ", "2 Q", .....," "Fox", "oxe", "xes", "es "
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "2 Quick Foxes"
}
您可以添加两个 pattern replace character filters -- 一个用于前导空格,另一个用于尾随:
PUT my-index-000001
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": [
"leading_space",
"trailing_space"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit",
"whitespace"
]
}
},
"char_filter": {
"leading_space": {
"type": "pattern_replace",
"pattern": "(^.)",
"replacement": " "
},
"trailing_space": {
"type": "pattern_replace",
"pattern": "(.$)",
"replacement": " "
}
}
}
}
}
}
注意在 my_tokenizer
的 token_chars
中添加了 whitespace
-- 如果没有它,以上内容将无法工作。
我正在尝试使用 elasticsearch 分析器生成 ngram 功能,特别是,我想在单词中添加 leading/trailing space。例如,如果单词是“2 Quick Foxes”,则具有 leading/trailing space 的 ngram 特征将是:
" 2 ", "2 Q", .....," "Fox", "oxe", "xes", "es "
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "2 Quick Foxes"
}
您可以添加两个 pattern replace character filters -- 一个用于前导空格,另一个用于尾随:
PUT my-index-000001
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": [
"leading_space",
"trailing_space"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit",
"whitespace"
]
}
},
"char_filter": {
"leading_space": {
"type": "pattern_replace",
"pattern": "(^.)",
"replacement": " "
},
"trailing_space": {
"type": "pattern_replace",
"pattern": "(.$)",
"replacement": " "
}
}
}
}
}
}
注意在 my_tokenizer
的 token_chars
中添加了 whitespace
-- 如果没有它,以上内容将无法工作。