在弹性搜索中使用内置或自定义分析器搜索数字和文本
Search for both numbers and text using a in-built or custom analyzer in elastic search
这个问题是我之前 SO 问题的延续。我有一些文本,我想对其进行数字和文本搜索。
我的文字:-
8080.foobar.getFooLabelFrombar(test.java:91)
我想在 getFooLabelFrombar
、fooBar
、8080
和 91
上搜索。
之前我使用的是 simple
分析器,它将上面的文本标记为下面的标记。
"tokens": [
{
"token": "foobar",
"start_offset": 10,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "getfoolabelfrombar",
"start_offset": 17,
"end_offset": 35,
"type": "word",
"position": 3
},
{
"token": "test",
"start_offset": 36,
"end_offset": 40,
"type": "word",
"position": 4
},
{
"token": "java",
"start_offset": 41,
"end_offset": 45,
"type": "word",
"position": 5
}
]
}
因为 foobar
和 getFooLabelFrombar
上的搜索给出了搜索结果,而不是 8080
和 91
,因为 simple analyzer 没有't tokenize 数字.
然后按照前面的建议。所以 post,我将分析器更改为 Standard
,因为可以搜索其中的数字,但不能搜索其他 2 字搜索字符串。由于标准分析器会创建以下标记:-
{
"tokens": [
{
"token": "8080",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 1
},
{
"token": "foobar.getfoolabelfrombar",
"start_offset": 5,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "test.java",
"start_offset": 36,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "91",
"start_offset": 46,
"end_offset": 48,
"type": "<NUM>",
"position": 4
}
]
}
我去了ES中所有现有的分析器,但似乎没有一个能满足我的要求。我尝试创建以下自定义分析器,但效果不佳。
{
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "letter"
"filter" : ["lowercase", "extract_numbers"]
}
},
"filter" : {
"extract_numbers" : {
"type" : "keep_types",
"types" : [ "<NUM>","<ALPHANUM>","word"]
}
}
}
}
请提出建议,如何构建我的自定义分析器以满足我的要求。
使用字符过滤器将点替换为空格怎么样?
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["replace_dots"]
}
},
"char_filter": {
"replace_dots": {
"type": "mapping",
"mappings": [
". => \u0020"
]
}
}
}
}
}
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "8080.foobar.getFooLabelFrombar(test.java:91)"
}
哪个输出你想要的:
{
"tokens" : [
{
"token" : "8080",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "foobar",
"start_offset" : 10,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "getFooLabelFrombar",
"start_offset" : 17,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "test",
"start_offset" : 36,
"end_offset" : 40,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "java",
"start_offset" : 41,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "91",
"start_offset" : 46,
"end_offset" : 48,
"type" : "<NUM>",
"position" : 6
}
]
}
这个问题是我之前
我的文字:-
8080.foobar.getFooLabelFrombar(test.java:91)
我想在 getFooLabelFrombar
、fooBar
、8080
和 91
上搜索。
之前我使用的是 simple
分析器,它将上面的文本标记为下面的标记。
"tokens": [
{
"token": "foobar",
"start_offset": 10,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "getfoolabelfrombar",
"start_offset": 17,
"end_offset": 35,
"type": "word",
"position": 3
},
{
"token": "test",
"start_offset": 36,
"end_offset": 40,
"type": "word",
"position": 4
},
{
"token": "java",
"start_offset": 41,
"end_offset": 45,
"type": "word",
"position": 5
}
]
}
因为 foobar
和 getFooLabelFrombar
上的搜索给出了搜索结果,而不是 8080
和 91
,因为 simple analyzer 没有't tokenize 数字.
然后按照前面的建议。所以 post,我将分析器更改为 Standard
,因为可以搜索其中的数字,但不能搜索其他 2 字搜索字符串。由于标准分析器会创建以下标记:-
{
"tokens": [
{
"token": "8080",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 1
},
{
"token": "foobar.getfoolabelfrombar",
"start_offset": 5,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "test.java",
"start_offset": 36,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "91",
"start_offset": 46,
"end_offset": 48,
"type": "<NUM>",
"position": 4
}
]
}
我去了ES中所有现有的分析器,但似乎没有一个能满足我的要求。我尝试创建以下自定义分析器,但效果不佳。
{
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "letter"
"filter" : ["lowercase", "extract_numbers"]
}
},
"filter" : {
"extract_numbers" : {
"type" : "keep_types",
"types" : [ "<NUM>","<ALPHANUM>","word"]
}
}
}
}
请提出建议,如何构建我的自定义分析器以满足我的要求。
使用字符过滤器将点替换为空格怎么样?
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["replace_dots"]
}
},
"char_filter": {
"replace_dots": {
"type": "mapping",
"mappings": [
". => \u0020"
]
}
}
}
}
}
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "8080.foobar.getFooLabelFrombar(test.java:91)"
}
哪个输出你想要的:
{
"tokens" : [
{
"token" : "8080",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "foobar",
"start_offset" : 10,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "getFooLabelFrombar",
"start_offset" : 17,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "test",
"start_offset" : 36,
"end_offset" : 40,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "java",
"start_offset" : 41,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "91",
"start_offset" : 46,
"end_offset" : 48,
"type" : "<NUM>",
"position" : 6
}
]
}