Elasticsearch - 映射类型文本和关键字分词器,它是如何被索引的?
Elasticsearch - mapping with type text and keyword tokenizer, how it is indexed?
我是 Elastisearch 的新手,对某个字段如何存储在 Lucene 索引中有点困惑,因为我收到错误消息:Document contains at least one implement term in field="originalrow.sortable" .....bytes 的长度最多为 32766;得到了 893970
索引模板中的映射:
"analyzer" : {
"rebuilt_hungarian" : {
"filter" : [
"lowercase",
"hungarian_stop",
"hungarian_keywords",
"hungarian_stemmer",
"asciifolding"
],
"tokenizer" : "standard"
},
"lowercase_for_sort" : {
"filter" : [
"lowercase"
],
"tokenizer" : "keyword"
}
}
..
..
"dynamic_templates" : [
{
"sortable_text" : {
"mapping" : {
"analyzer" : "rebuilt_hungarian",
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
},
"sortable" : {
"fielddata" : true,
"analyzer" : "lowercase_for_sort",
"type" : "text"
}
}
},
"match_mapping_type" : "string"
}
}
],
以及为错误涉及的字段生成的映射:
"originalrow" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
},
"sortable" : {
"type" : "text",
"analyzer" : "lowercase_for_sort",
"fielddata" : true
}
},
"analyzer" : "rebuilt_hungarian"
}
所以我认为 - 当然我可能错了 - originarow.sortable 字段被索引为文本但整个文本由于关键字标记器而进入倒排索引,这可能是错误的原因。
另一件事是文本的长度约为 1800 个字符,我不知道大小如何
可以超过 32K 字节。
提前致谢!!!
对于您的字段 sortable
,您正在使用 lowercase_for_sort
,它再次使用 keyword
分词器生成单个分词,在 Lucene 中,分词的最大大小为 32766,如前所述在 this post.
如果您使用的字符超过 1 个字节,则可以超过此限制。来自 UTF docs
A UTF maps each Unicode code point to a unique code unit sequence. A
code unit is the minimal bit combination that can represent a
character. Each UTF uses a different code unit size. For example,
UTF-8 is based on 8-bit code units. Therefore, each character can be 8
bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes), or 32 bits (4
bytes). Likewise, UTF-16 is based on 16-bit code units. Therefore,
each character can be 16 bits (2 bytes) or 32 bits (4 bytes).
我是 Elastisearch 的新手,对某个字段如何存储在 Lucene 索引中有点困惑,因为我收到错误消息:Document contains at least one implement term in field="originalrow.sortable" .....bytes 的长度最多为 32766;得到了 893970
索引模板中的映射:
"analyzer" : {
"rebuilt_hungarian" : {
"filter" : [
"lowercase",
"hungarian_stop",
"hungarian_keywords",
"hungarian_stemmer",
"asciifolding"
],
"tokenizer" : "standard"
},
"lowercase_for_sort" : {
"filter" : [
"lowercase"
],
"tokenizer" : "keyword"
}
}
..
..
"dynamic_templates" : [
{
"sortable_text" : {
"mapping" : {
"analyzer" : "rebuilt_hungarian",
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
},
"sortable" : {
"fielddata" : true,
"analyzer" : "lowercase_for_sort",
"type" : "text"
}
}
},
"match_mapping_type" : "string"
}
}
],
以及为错误涉及的字段生成的映射:
"originalrow" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
},
"sortable" : {
"type" : "text",
"analyzer" : "lowercase_for_sort",
"fielddata" : true
}
},
"analyzer" : "rebuilt_hungarian"
}
所以我认为 - 当然我可能错了 - originarow.sortable 字段被索引为文本但整个文本由于关键字标记器而进入倒排索引,这可能是错误的原因。 另一件事是文本的长度约为 1800 个字符,我不知道大小如何 可以超过 32K 字节。
提前致谢!!!
对于您的字段 sortable
,您正在使用 lowercase_for_sort
,它再次使用 keyword
分词器生成单个分词,在 Lucene 中,分词的最大大小为 32766,如前所述在 this post.
如果您使用的字符超过 1 个字节,则可以超过此限制。来自 UTF docs
A UTF maps each Unicode code point to a unique code unit sequence. A code unit is the minimal bit combination that can represent a character. Each UTF uses a different code unit size. For example, UTF-8 is based on 8-bit code units. Therefore, each character can be 8 bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes), or 32 bits (4 bytes). Likewise, UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes).