Elasticsearch - 映射类型文本和关键字分词器,它是如何被索引的?

Elasticsearch - mapping with type text and keyword tokenizer, how it is indexed?

我是 Elastisearch 的新手,对某个字段如何存储在 Lucene 索引中有点困惑,因为我收到错误消息:Document contains at least one implement term in field="originalrow.sortable" .....bytes 的长度最多为 32766;得到了 893970

索引模板中的映射:

 "analyzer" : {
    "rebuilt_hungarian" : {
      "filter" : [
        "lowercase",
        "hungarian_stop",
        "hungarian_keywords",
        "hungarian_stemmer",
        "asciifolding"
      ],
      "tokenizer" : "standard"
    },
    "lowercase_for_sort" : {
      "filter" : [
        "lowercase"
      ],
      "tokenizer" : "keyword"
    }
  }
  ..
  ..
    "dynamic_templates" : [
    {
      "sortable_text" : {
        "mapping" : {
          "analyzer" : "rebuilt_hungarian",
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            },
            "sortable" : {
              "fielddata" : true,
              "analyzer" : "lowercase_for_sort",
              "type" : "text"
            }
          }
        },
        "match_mapping_type" : "string"
      }
    }
  ],

以及为错误涉及的字段生成的映射:

"originalrow" : {
  "type" : "text",
  "fields" : {
    "keyword" : {
      "type" : "keyword"
    },
    "sortable" : {
      "type" : "text",
      "analyzer" : "lowercase_for_sort",
      "fielddata" : true
    }
  },
  "analyzer" : "rebuilt_hungarian"
}

所以我认为 - 当然我可能错了 - originarow.sortable 字段被索引为文本但整个文本由于关键字标记器而进入倒排索引,这可能是错误的原因。 另一件事是文本的长度约为 1800 个字符,我不知道大小如何 可以超过 32K 字节。

提前致谢!!!

对于您的字段 sortable,您正在使用 lowercase_for_sort,它再次使用 keyword 分词器生成单个分词,在 Lucene 中,分词的最大大小为 32766,如前所述在 this post.

如果您使用的字符超过 1 个字节,则可以超过此限制。来自 UTF docs

A UTF maps each Unicode code point to a unique code unit sequence. A code unit is the minimal bit combination that can represent a character. Each UTF uses a different code unit size. For example, UTF-8 is based on 8-bit code units. Therefore, each character can be 8 bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes), or 32 bits (4 bytes). Likewise, UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes).