preserve_original elasticsearch中的原始token
preserve_original the original token in elasticsearch
我有一个令牌过滤器和分析器,如下所示。但是,我无法保留原始令牌。例如,如果我 _analyze
使用单词: saint-louis
,我只返回 saintlouis
,而我希望得到两个 saintlouis and saint-louis
因为我有 preserve_original set to true
. ES version i am using is 6.3.2 and Lucene version is 7.3.1
"analysis": {
"filter": {
"hyphenFilter": {
"pattern": "-",
"type": "pattern_replace",
"preserve_original": "true",
"replacement": ""
}
},
"analyzer": {
"whitespace_lowercase": {
"filter": [
"lowercase",
"asciifolding",
"hyphenFilter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
所以看起来 preserve_original
在 pattern_replace
令牌过滤器上不受支持,至少在我使用的版本上不支持。
我做了如下解决方法:
索引定义
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"type": "custom",
"filter": [
"lowercase",
"hyphen_filter"
]
}
},
"filter": {
"hyphen_filter": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_words": "true"
}
}
}
}
}
例如,这会将 anti-spam
之类的词标记为 antispam(removed the hyphen)
、anti-spam(preserved the original)
、anti
和 spam.
分析器API 查看生成的标记
POST/_分析
{
"text": "anti-spam",
"analyzer" : "my_analyzer"
}
分析的输出API 即。生成的令牌
{
"tokens": [
{
"token": "anti-spam",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "anti",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "antispam",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "spam",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 1
}
]
}
我有一个令牌过滤器和分析器,如下所示。但是,我无法保留原始令牌。例如,如果我 _analyze
使用单词: saint-louis
,我只返回 saintlouis
,而我希望得到两个 saintlouis and saint-louis
因为我有 preserve_original set to true
. ES version i am using is 6.3.2 and Lucene version is 7.3.1
"analysis": {
"filter": {
"hyphenFilter": {
"pattern": "-",
"type": "pattern_replace",
"preserve_original": "true",
"replacement": ""
}
},
"analyzer": {
"whitespace_lowercase": {
"filter": [
"lowercase",
"asciifolding",
"hyphenFilter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
所以看起来 preserve_original
在 pattern_replace
令牌过滤器上不受支持,至少在我使用的版本上不支持。
我做了如下解决方法:
索引定义
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"type": "custom",
"filter": [
"lowercase",
"hyphen_filter"
]
}
},
"filter": {
"hyphen_filter": {
"type": "word_delimiter",
"preserve_original": "true",
"catenate_words": "true"
}
}
}
}
}
例如,这会将 anti-spam
之类的词标记为 antispam(removed the hyphen)
、anti-spam(preserved the original)
、anti
和 spam.
分析器API 查看生成的标记
POST/_分析
{ "text": "anti-spam", "analyzer" : "my_analyzer" }
分析的输出API 即。生成的令牌
{
"tokens": [
{
"token": "anti-spam",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "anti",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "antispam",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "spam",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 1
}
]
}