从 Elasticsearch 中的 CamelCase 分词器中排除

Question

在 Elasticsearch 中搜索 iphone 时努力使 iPhone 匹配。

因为我有一些源代码，我肯定需要 CamelCase tokenizer，但它似乎将 iPhone 分成两个词，所以 iphone 找不到。

有人知道添加例外以将驼峰式单词转换为标记（驼峰式 + 大小写）的方法吗？

更新：为了明确起见，我希望 NullPointerException 被标记为 [null，指针，异常]，但我不希望 iPhone 变成 [i，phone]。

还有其他解决方案吗？

更新 2：@ChintanShah 的回答提出了一种不同的方法，它给了我们更多——NullPointerException 将被标记为 [null, pointer, exception, nullpointer, pointerexception, nullpointerexception]，从搜索的视图。而且索引也更快！付出的代价是索引大小，但它是一个更好的解决方案。

Answer 1

您可以通过 word_delimiter token filter 实现您的要求。这是我的设置

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "camel_filter",
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "camel_filter": {
          "type": "word_delimiter",
          "generate_number_parts": false,
          "stem_english_possessive": false,
          "split_on_numerics": false,
          "protected_words": [
            "iPhone",
            "WiFi"
          ]
        }
      }
    }
  },
  "mappings": {
  }
}

这将在 大小写更改 上拆分单词，因此 NullPointerException 将被标记为 null、 pointer 和 exception 但 iPhone 和 WiFi 将保持原样因为它们受到保护。 word_delimiter 有很多灵活的选项。你也可以preserve_original这对你有很大帮助。

GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer

结果

{
   "tokens": [
      {
         "token": "iphone",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 1
      }
   ]
}

现在

GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer

结果

{
   "tokens": [
      {
         "token": "null",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 1
      },
      {
         "token": "pointer",
         "start_offset": 4,
         "end_offset": 11,
         "type": "word",
         "position": 2
      },
      {
         "token": "exception",
         "start_offset": 11,
         "end_offset": 20,
         "type": "word",
         "position": 3
      }
   ]
}

另一种方法是使用不同的分析仪对您的领域进行两次分析，但我认为 word_delimiter 可以解决问题。

这有帮助吗？

从 Elasticsearch 中的 CamelCase 分词器中排除

Exclude from CamelCase tokenizer in Elasticsearch

camelcasing

elasticsearch