无法理解弹性搜索分析器正则表达式
Unable to understand elasticsearch analyser regex
谁能帮我理解为什么我对 elasticsearch 分析器的理解是错误的?
我有一个包含各种字段的索引,其中一个是:
"categories": {
"type": "text",
"analyzer": "words_only_analyser",
"copy_to": "all",
"fields": {
"tokens": {
"type": "text",
"analyzer": "words_only_analyser",
"term_vector": "yes",
"fielddata" : True
}
}
}
words_only_analyser
看起来像:
"words_only_analyser":{
"type":"custom",
"tokenizer":"words_only_tokenizer",
"char_filter" : ["html_strip"],
"filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},
并且 words_only_tokenizer
看起来像:
"tokenizer":{
"words_only_tokenizer":{
"type":"pattern",
"pattern":"[^\w-]+"
}
}
我对 tokenizer
中的 pattern
[^\w-]+
的理解是,它将标记一个句子,以便在 \
或 w
或 -
。例如,给定模式,句子为:
seasonal-christmas-halloween this is a description about halloween
我希望看到:
[seasonal, christmas, hallo, een this is a description about hallo, een]
确认以上内容
但是,当我运行words_only_analyser
上面的句子时:
curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'
我明白了,
{
"tokens" : [
{
"token" : "seasonal-christmas-halloween",
"start_offset" : 0,
"end_offset" : 28,
"type" : "word",
"position" : 0
},
{
"token" : "description",
"start_offset" : 39,
"end_offset" : 50,
"type" : "word",
"position" : 4
},
{
"token" : "halloween",
"start_offset" : 57,
"end_offset" : 66,
"type" : "word",
"position" : 6
}
]
}
这告诉我句子被标记化为:
[seasonal-christmas-halloween, description, halloween]
在我看来分词器模式没有实现?谁能解释一下我的理解哪里不对?
有几件事改变了分析器生成的最终标记,首先是标记器,然后是 token-filters(因为 ex:you 有 stop_filter 删除了this
、is
、a
).
等停用词
您也可以使用分析 API 来测试您的 tokenizer
,我创建了您的配置,它产生了以下标记。
POST_分析
{
"tokenizer": "words_only_tokenizer", // Note `tokenizer` here
"text": "seasonal-christmas-halloween this is a description about halloween"
}
结果
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "this",
"start_offset": 29,
"end_offset": 33,
"type": "word",
"position": 1
},
{
"token": "is",
"start_offset": 34,
"end_offset": 36,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 37,
"end_offset": 38,
"type": "word",
"position": 3
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}
你可以注意到,仍然存在停用词,因为它只是打破了空格上的标记,并没有考虑 -
。
现在,如果你 运行 同样在 analyzer
上也有 filters
,它会减少 stop words
并给你以下标记。
POST_分析
{
"analyzer": "words_only_analyser",
"text": "seasonal-christmas-halloween this is a description about halloween"
}
结果
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}
谁能帮我理解为什么我对 elasticsearch 分析器的理解是错误的?
我有一个包含各种字段的索引,其中一个是:
"categories": {
"type": "text",
"analyzer": "words_only_analyser",
"copy_to": "all",
"fields": {
"tokens": {
"type": "text",
"analyzer": "words_only_analyser",
"term_vector": "yes",
"fielddata" : True
}
}
}
words_only_analyser
看起来像:
"words_only_analyser":{
"type":"custom",
"tokenizer":"words_only_tokenizer",
"char_filter" : ["html_strip"],
"filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},
并且 words_only_tokenizer
看起来像:
"tokenizer":{
"words_only_tokenizer":{
"type":"pattern",
"pattern":"[^\w-]+"
}
}
我对 tokenizer
中的 pattern
[^\w-]+
的理解是,它将标记一个句子,以便在 \
或 w
或 -
。例如,给定模式,句子为:
seasonal-christmas-halloween this is a description about halloween
我希望看到:
[seasonal, christmas, hallo, een this is a description about hallo, een]
确认以上内容
但是,当我运行words_only_analyser
上面的句子时:
curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'
我明白了,
{
"tokens" : [
{
"token" : "seasonal-christmas-halloween",
"start_offset" : 0,
"end_offset" : 28,
"type" : "word",
"position" : 0
},
{
"token" : "description",
"start_offset" : 39,
"end_offset" : 50,
"type" : "word",
"position" : 4
},
{
"token" : "halloween",
"start_offset" : 57,
"end_offset" : 66,
"type" : "word",
"position" : 6
}
]
}
这告诉我句子被标记化为:
[seasonal-christmas-halloween, description, halloween]
在我看来分词器模式没有实现?谁能解释一下我的理解哪里不对?
有几件事改变了分析器生成的最终标记,首先是标记器,然后是 token-filters(因为 ex:you 有 stop_filter 删除了this
、is
、a
).
您也可以使用分析 API 来测试您的 tokenizer
,我创建了您的配置,它产生了以下标记。
POST_分析
{
"tokenizer": "words_only_tokenizer", // Note `tokenizer` here
"text": "seasonal-christmas-halloween this is a description about halloween"
}
结果
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "this",
"start_offset": 29,
"end_offset": 33,
"type": "word",
"position": 1
},
{
"token": "is",
"start_offset": 34,
"end_offset": 36,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 37,
"end_offset": 38,
"type": "word",
"position": 3
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}
你可以注意到,仍然存在停用词,因为它只是打破了空格上的标记,并没有考虑 -
。
现在,如果你 运行 同样在 analyzer
上也有 filters
,它会减少 stop words
并给你以下标记。
POST_分析
{
"analyzer": "words_only_analyser",
"text": "seasonal-christmas-halloween this is a description about halloween"
}
结果
{
"tokens": [
{
"token": "seasonal-christmas-halloween",
"start_offset": 0,
"end_offset": 28,
"type": "word",
"position": 0
},
{
"token": "description",
"start_offset": 39,
"end_offset": 50,
"type": "word",
"position": 4
},
{
"token": "about",
"start_offset": 51,
"end_offset": 56,
"type": "word",
"position": 5
},
{
"token": "halloween",
"start_offset": 57,
"end_offset": 66,
"type": "word",
"position": 6
}
]
}