具有两个输出标记的 Elasticsearch 自定义分析器
Elasticsearch custom analyzer with two output tokens
要求是创建一个自定义分析器,它可以生成两个令牌,如下面的场景所示。
例如
Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in
我能够删除非字母数字字符,但如何在输出标记列表中也保留原始字符。下面是我创建的自定义分析器。
"alphanumericStringAnalyzer": {
"filter": [
"lowercase",
"minLength_filter"],
"char_filter": [
"specialCharactersFilter"
],
"type": "custom",
"tokenizer": "keyword"
}
"char_filter": {
"specialCharactersFilter": {
"pattern": "[^A-Za-z0-9]",
"type": "pattern_replace",
"replacement": ""
}
},
此分析器正在为输入 "B.tech in" 生成单个标记 "btechin",但我也想要标记列表中的原始标记 "B.tech in"
谢谢!
您可以按照本 documentation
中的说明使用单词标记定界符
这里是单词分隔符配置的示例:
POST _analyze
{
"text": "B.tech in",
"tokenizer": "keyword",
"filter": [
"lowercase",
{
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true,
"generate_word_parts": false
}
]
}
结果:
{
"tokens": [
{
"token": "b.tech in",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "btechin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
希望能满足您的要求!
要求是创建一个自定义分析器,它可以生成两个令牌,如下面的场景所示。
例如
Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in
我能够删除非字母数字字符,但如何在输出标记列表中也保留原始字符。下面是我创建的自定义分析器。
"alphanumericStringAnalyzer": {
"filter": [
"lowercase",
"minLength_filter"],
"char_filter": [
"specialCharactersFilter"
],
"type": "custom",
"tokenizer": "keyword"
}
"char_filter": {
"specialCharactersFilter": {
"pattern": "[^A-Za-z0-9]",
"type": "pattern_replace",
"replacement": ""
}
},
此分析器正在为输入 "B.tech in" 生成单个标记 "btechin",但我也想要标记列表中的原始标记 "B.tech in"
谢谢!
您可以按照本 documentation
中的说明使用单词标记定界符这里是单词分隔符配置的示例:
POST _analyze
{
"text": "B.tech in",
"tokenizer": "keyword",
"filter": [
"lowercase",
{
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true,
"generate_word_parts": false
}
]
}
结果:
{
"tokens": [
{
"token": "b.tech in",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "btechin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
希望能满足您的要求!