如何在 ElasticSearch 中标记罗马数字术语？

Question

通过如下注册令牌字符创建分词器时，无法注册罗马 'X'。（测试 ES 版本：ES6.7、ES5.6）

      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 14,
          "token_chars": [
            "Ⅹ"
          ]
        }
    }

错误日志是这样的

{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node02][192.168.115.x:9300][indices:admin/create]"}],"type":"illegal_argument_exception","reason":"Unknown token type: 'ⅹ', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]"},"status":400}

如何将罗马数字标记为术语？

Answer 1

错误消息明确指出您的罗马 X 不是有效的 token type。错误消息还列出了 token type 的有效选项，如下所示：

must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]

问题出在你的语法上，如果你参考官方 ES 文档 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html 的令牌字符，那么你可以理解它的含义，如下所述：

Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).

下面再次将有效值指定为 digit，letter 同样 link 有一些示例，其中他们使用 token_chars 和有效值。

如果您在分析器设置中将 X 替换为 letter，您的问题就会得到解决。

如何在 ElasticSearch 中标记罗马数字术语？

How to tokenize a Roman numeral term in ElasticSearch?

lucene

tokenize

elasticsearch

elasticsearch-analyzers