为 elasticsearch 创建大型 char_filter 列表的自定义分析器

Custom analyzer with large char_filter list creation for elasticsearch

我尝试将自定义分析器添加到弹性搜索中。我的同义词列表太大 "mappings" (mapper_list)。 mapper_list 的大小约为 30.000 个元素。

requests.post(es_host + '/_close')

settings = {
    "settings" : {
        "analysis" : {
            "char_filter" : {
                "my_mapping" : {
                    "type" : "mapping",
                    "mappings" : mapper_list
                }
            },
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "char_filter" : ["my_mapping"]
                }
            }
        }
    }
}

requests.put(es_host + '/_settings',
             data=json.dumps(settings))

requests.post(es_host + '/_open')

来自弹性搜索的错误消息

[test-index] IndexCreationException[failed to create index]; nested: ArrayIndexOutOfBoundsException[256];
    at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:360)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:313)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:174)
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

请对解决此问题的方法提出任何意见。

关于 ES 版本的信息:

  "version" : {
    "number" : "2.4.1",
    "build_hash" : "c67dc32e24162035d18d6fe1e952c4cbcbe79d16",
    "build_timestamp" : "2016-09-27T18:57:55Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.2"
  }

我认为错误的原因是由于大句子的映射。你到底想映射什么?如果您查看 source code 并且您违反了该限制,则有 256 个字符的限制。我得到同样的异常

ArrayIndexOutOfBoundsException[256]

如果我尝试映射大字符串。

{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": ["More than 256 characters. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. => exception will be thrown"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_mapping"
          ]
        }
      }
    }
  }
}

我不知道你的用例,但你需要减少你正在映射的字符串的长度,然后它应该工作。