preserve_original elasticsearch中的原始token

preserve_original the original token in elasticsearch

我有一个令牌过滤器和分析器,如下所示。但是,我无法保留原始令牌。例如,如果我 _analyze 使用单词: saint-louis ,我只返回 saintlouis,而我希望得到两个 saintlouis and saint-louis 因为我有 preserve_original set to true . ES version i am using is 6.3.2 and Lucene version is 7.3.1

"analysis": {
  "filter": {
    "hyphenFilter": {
      "pattern": "-",
      "type": "pattern_replace",
      "preserve_original": "true",
      "replacement": ""
    }
  },
  "analyzer": {
    "whitespace_lowercase": {
      "filter": [
        "lowercase",
        "asciifolding",
        "hyphenFilter"
      ],
      "type": "custom",
      "tokenizer": "whitespace"
    }
  }
}

所以看起来 preserve_originalpattern_replace 令牌过滤器上不受支持,至少在我使用的版本上不支持。

我做了如下解决方法:

索引定义

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "type": "custom",
                    "filter": [
                        "lowercase",
                        "hyphen_filter"
                    ]
                }
            },
            "filter": {
                "hyphen_filter": {
                    "type": "word_delimiter",
                    "preserve_original": "true",
                    "catenate_words": "true"
                }
            }
        }
    }
}

例如,这会将 anti-spam 之类的词标记为 antispam(removed the hyphen)anti-spam(preserved the original)antispam.

分析器API 查看生成的标记

POST/_分析

{ "text": "anti-spam", "analyzer" : "my_analyzer" }

分析的输出API 即。生成的令牌

{
    "tokens": [
        {
            "token": "anti-spam",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "anti",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "antispam",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "spam",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 1
        }
    ]
}