无法理解弹性搜索分析器正则表达式

Unable to understand elasticsearch analyser regex

谁能帮我理解为什么我对 elasticsearch 分析器的理解是错误的?

我有一个包含各种字段的索引,其中一个是:

"categories": {
    "type": "text",
    "analyzer": "words_only_analyser",
    "copy_to": "all",
    "fields": {
         "tokens": {
             "type": "text",
             "analyzer": "words_only_analyser",
             "term_vector": "yes",
             "fielddata" : True
          }
      }
}

words_only_analyser 看起来像:

"words_only_analyser":{
    "type":"custom",
    "tokenizer":"words_only_tokenizer",
    "char_filter" : ["html_strip"],
    "filter":[ "lowercase", "asciifolding", "stop_filter", "kstem" ]
},

并且 words_only_tokenizer 看起来像:

"tokenizer":{
    "words_only_tokenizer":{
    "type":"pattern",
    "pattern":"[^\w-]+"
    }
}

我对 tokenizer 中的 pattern [^\w-]+ 的理解是,它将标记一个句子,以便在 \w-。例如,给定模式,句子为:

seasonal-christmas-halloween this is a description about halloween

我希望看到:

[seasonal, christmas, hallo, een this is a description about hallo, een]

我可以从https://regex101.com/

确认以上内容

但是,当我运行words_only_analyser上面的句子时:

curl -XGET localhost:9200/contextual/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer":"words_only_analyser","text":"seasonal-christmas-halloween this is a description about halloween"}'

我明白了,

{
  "tokens" : [
    {
      "token" : "seasonal-christmas-halloween",
      "start_offset" : 0,
      "end_offset" : 28,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "description",
      "start_offset" : 39,
      "end_offset" : 50,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "halloween",
      "start_offset" : 57,
      "end_offset" : 66,
      "type" : "word",
      "position" : 6
    }
  ]
}

这告诉我句子被标记化为:

[seasonal-christmas-halloween, description, halloween]

在我看来分词器模式没有实现?谁能解释一下我的理解哪里不对?

有几件事改变了分析器生成的最终标记,首先是标记器,然后是 token-filters(因为 ex:you 有 stop_filter 删除了thisisa).

等停用词

您也可以使用分析 API 来测试您的 tokenizer,我创建了您的配置,它产生了以下标记。

POST_分析

{
    "tokenizer": "words_only_tokenizer", // Note `tokenizer` here
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

结果

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "this",
            "start_offset": 29,
            "end_offset": 33,
            "type": "word",
            "position": 1
        },
        {
            "token": "is",
            "start_offset": 34,
            "end_offset": 36,
            "type": "word",
            "position": 2
        },
        {
            "token": "a",
            "start_offset": 37,
            "end_offset": 38,
            "type": "word",
            "position": 3
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}

你可以注意到,仍然存在停用词,因为它只是打破了空格上的标记,并没有考虑 -

现在,如果你 运行 同样在 analyzer 上也有 filters,它会减少 stop words 并给你以下标记。

POST_分析

{
    "analyzer": "words_only_analyser",
    "text": "seasonal-christmas-halloween this is a description about halloween"
}

结果

{
    "tokens": [
        {
            "token": "seasonal-christmas-halloween",
            "start_offset": 0,
            "end_offset": 28,
            "type": "word",
            "position": 0
        },
        {
            "token": "description",
            "start_offset": 39,
            "end_offset": 50,
            "type": "word",
            "position": 4
        },
        {
            "token": "about",
            "start_offset": 51,
            "end_offset": 56,
            "type": "word",
            "position": 5
        },
        {
            "token": "halloween",
            "start_offset": 57,
            "end_offset": 66,
            "type": "word",
            "position": 6
        }
    ]
}