带点的字母数字值的 Elasticsearch 分析器令牌
Elasticsearch analyzer tokens for alphanumeric value with dot
我有一个具有此值的文本字段-
term1-term2-term3-term4-term5-RWHPSA951000155.2013-05-27.log
当我使用分析 API(默认分析器)进行检查时,我得到这个 -
{
"tokens": [
{
"token": "text",
"start_offset": 2,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "term1",
"start_offset": 9,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "term2",
"start_offset": 15,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "term3",
"start_offset": 21,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "term4",
"start_offset": 27,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "term5",
"start_offset": 33,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "rwhpsa951000155.2013",
"start_offset": 39,
"end_offset": 59,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "05",
"start_offset": 60,
"end_offset": 62,
"type": "<NUM>",
"position": 8
},
{
"token": "27",
"start_offset": 63,
"end_offset": 65,
"type": "<NUM>",
"position": 9
},
{
"token": "log",
"start_offset": 66,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 10
}
]
}
我对这个令牌特别好奇 - rwhpsa951000155.2013
。这是怎么发生的?因此,目前我对匹配 RWHPSA951000155
的搜索失败了。我怎样才能让它将 RWHPSA951000155
和 2013
识别为单独的标记?
请注意,如果值为 term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log
,则它将 RWHPSA
和 2013
拆分为单独的标记。所以这与 951000155
.
有关
谢谢,
Standard Analyzer 正在将 rwhpsa951000155.2013
标记为产品编号。
Splits words at hyphens, unless there's a number in the token, in
which case the whole token is interpreted as a product number and is
not split.
您可以添加模式分析器来替换“.”带白色space。默认分析器随后将按照您想要的方式对术语进行标记。
/POST test
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"my_pattern": {
"type": "pattern_replace",
"pattern": "\.",
"replacement": " "
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_pattern"
]
}
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"test": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
调用分析API:
curl -XGET 'localhost:9200/test/_analyze?analyzer=my_analyzer&pretty=true' -d 'term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log'
Returns:
{
"tokens" : [ {
"token" : "term1",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "term2",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "term3",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "term4",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "term5",
"start_offset" : 24,
"end_offset" : 29,
"type" : "<ALPHANUM>",
"position" : 5
}, {
"token" : "RWHPSA951000155",
"start_offset" : 30,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 6
}, {
"token" : "2013",
"start_offset" : 46,
"end_offset" : 50,
"type" : "<NUM>",
"position" : 7
}, {
"token" : "05",
"start_offset" : 51,
"end_offset" : 53,
"type" : "<NUM>",
"position" : 8
}, {
"token" : "27",
"start_offset" : 54,
"end_offset" : 56,
"type" : "<NUM>",
"position" : 9
}, {
"token" : "log",
"start_offset" : 57,
"end_offset" : 60,
"type" : "<ALPHANUM>",
"position" : 10
} ]
}
我有一个具有此值的文本字段-
term1-term2-term3-term4-term5-RWHPSA951000155.2013-05-27.log
当我使用分析 API(默认分析器)进行检查时,我得到这个 -
{
"tokens": [
{
"token": "text",
"start_offset": 2,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "term1",
"start_offset": 9,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "term2",
"start_offset": 15,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "term3",
"start_offset": 21,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "term4",
"start_offset": 27,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "term5",
"start_offset": 33,
"end_offset": 38,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "rwhpsa951000155.2013",
"start_offset": 39,
"end_offset": 59,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "05",
"start_offset": 60,
"end_offset": 62,
"type": "<NUM>",
"position": 8
},
{
"token": "27",
"start_offset": 63,
"end_offset": 65,
"type": "<NUM>",
"position": 9
},
{
"token": "log",
"start_offset": 66,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 10
}
]
}
我对这个令牌特别好奇 - rwhpsa951000155.2013
。这是怎么发生的?因此,目前我对匹配 RWHPSA951000155
的搜索失败了。我怎样才能让它将 RWHPSA951000155
和 2013
识别为单独的标记?
请注意,如果值为 term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log
,则它将 RWHPSA
和 2013
拆分为单独的标记。所以这与 951000155
.
谢谢,
Standard Analyzer 正在将 rwhpsa951000155.2013
标记为产品编号。
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
您可以添加模式分析器来替换“.”带白色space。默认分析器随后将按照您想要的方式对术语进行标记。
/POST test
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"my_pattern": {
"type": "pattern_replace",
"pattern": "\.",
"replacement": " "
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_pattern"
]
}
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"test": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
调用分析API:
curl -XGET 'localhost:9200/test/_analyze?analyzer=my_analyzer&pretty=true' -d 'term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log'
Returns:
{
"tokens" : [ {
"token" : "term1",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "term2",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "term3",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "term4",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "term5",
"start_offset" : 24,
"end_offset" : 29,
"type" : "<ALPHANUM>",
"position" : 5
}, {
"token" : "RWHPSA951000155",
"start_offset" : 30,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 6
}, {
"token" : "2013",
"start_offset" : 46,
"end_offset" : 50,
"type" : "<NUM>",
"position" : 7
}, {
"token" : "05",
"start_offset" : 51,
"end_offset" : 53,
"type" : "<NUM>",
"position" : 8
}, {
"token" : "27",
"start_offset" : 54,
"end_offset" : 56,
"type" : "<NUM>",
"position" : 9
}, {
"token" : "log",
"start_offset" : 57,
"end_offset" : 60,
"type" : "<ALPHANUM>",
"position" : 10
} ]
}