如何在 ElasticSearch 中标记罗马数字术语?
How to tokenize a Roman numeral term in ElasticSearch?
通过如下注册令牌字符创建分词器时,无法注册罗马 'X'。(测试 ES 版本:ES6.7、ES5.6)
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 14,
"token_chars": [
"Ⅹ"
]
}
}
错误日志是这样的
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node02][192.168.115.x:9300][indices:admin/create]"}],"type":"illegal_argument_exception","reason":"Unknown
token type: 'ⅹ', must be one of [symbol, private_use,
paragraph_separator, start_punctuation, unassigned, enclosing_mark,
connector_punctuation, letter_number, other_number, math_symbol,
lowercase_letter, space_separator, surrogate,
initial_quote_punctuation, decimal_digit_number, digit,
other_punctuation, dash_punctuation, currency_symbol,
non_spacing_mark, format, modifier_letter, control, uppercase_letter,
other_symbol, end_punctuation, modifier_symbol, other_letter,
line_separator, titlecase_letter, letter, punctuation,
combining_spacing_mark, final_quote_punctuation,
whitespace]"},"status":400}
如何将罗马数字标记为术语?
错误消息明确指出您的罗马 X
不是有效的 token type
。错误消息还列出了 token type
的有效选项,如下所示:
must be one of [symbol, private_use, paragraph_separator,
start_punctuation, unassigned, enclosing_mark, connector_punctuation,
letter_number, other_number, math_symbol, lowercase_letter,
space_separator, surrogate, initial_quote_punctuation,
decimal_digit_number, digit, other_punctuation, dash_punctuation,
currency_symbol, non_spacing_mark, format, modifier_letter, control,
uppercase_letter, other_symbol, end_punctuation, modifier_symbol,
other_letter, line_separator, titlecase_letter, letter, punctuation,
combining_spacing_mark, final_quote_punctuation, whitespace]
问题出在你的语法上,如果你参考官方 ES 文档 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html 的令牌字符,那么你可以理解它的含义,如下所述:
Character classes that should be included in a token. Elasticsearch
will split on characters that don’t belong to the classes specified.
Defaults to [] (keep all characters).
下面再次将有效值指定为 digit
,letter
同样 link 有一些示例,其中他们使用 token_chars
和有效值。
如果您在分析器设置中将 X
替换为 letter
,您的问题就会得到解决。
通过如下注册令牌字符创建分词器时,无法注册罗马 'X'。(测试 ES 版本:ES6.7、ES5.6)
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 14,
"token_chars": [
"Ⅹ"
]
}
}
错误日志是这样的
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node02][192.168.115.x:9300][indices:admin/create]"}],"type":"illegal_argument_exception","reason":"Unknown token type: 'ⅹ', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]"},"status":400}
如何将罗马数字标记为术语?
错误消息明确指出您的罗马 X
不是有效的 token type
。错误消息还列出了 token type
的有效选项,如下所示:
must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]
问题出在你的语法上,如果你参考官方 ES 文档 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html 的令牌字符,那么你可以理解它的含义,如下所述:
Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).
下面再次将有效值指定为 digit
,letter
同样 link 有一些示例,其中他们使用 token_chars
和有效值。
如果您在分析器设置中将 X
替换为 letter
,您的问题就会得到解决。