负先行正则表达式在 ES dsl 查询中不起作用

Question

我的 Elastic 搜索的映射如下所示：

{
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "1"
    }
  },
  "mappings": {
    "node": {
      "properties": {
        "field1": {
          "type": "keyword"
        },
        "field2": {
          "type": "keyword"
        },
        "query": {
          "properties": {
            "regexp": {
              "properties": {
                "field1": {
                  "type": "keyword"
                },
                "field2": {
                  "type": "keyword"
                }
              }
            }
          }
        }
      }
    }
  }
}

问题是：

我正在使用 elasticsearch_dsl Q() 构建 ES 查询。当我的查询包含任何复杂的正则表达式时，它在大多数情况下都能正常工作。但如果它包含正则表达式字符 '!'，它就完全失败了在里面。当搜索词包含“！”时，它不会给出任何结果在里面。

例如：

1.) Q('regexp', field1 = "^[a-z]{3}.b.*")（完美运行）

2.) Q('regexp', field1 = "^f04.*")（完美运行）

3.)Q('regexp', field1 = "f00.*")（完美运行）

4.) Q('regexp', field1 = "f04baz?")（完美运行）

在以下情况下失败：

5.) Q('regexp', field1 = "f04((?!z).)*")（失败，完全没有结果）

我尝试在字段中添加 "analyzer":"keyword" 和 "type":"keyword"，但在那种情况下没有任何效果。

在浏览器中，我尝试检查 analyzer:keyword 在失败的情况下如何处理输入：

http://localhost:9210/search/_analyze?analyzer=keyword&text=f04((?!z).)*

这里看起来不错，结果：

{
  "tokens": [
    {
      "token": "f04((?!z).)*",
      "start_offset": 0,
      "end_offset": 12,
      "type": "word",
      "position": 0
    }
  ]
}

我运行我的查询如下：

search_obj = Search(using = _conn, index = _index, doc_type = _type).query(Q('regexp', field1 = "f04baz?"))
count = search_obj.count()
response = search_obj[0:count].execute()
logger.debug("total nodes(hits):" + " " + str(response.hits.total))

求助，这确实是一个恼人的问题，因为所有正则表达式字符在所有查询中都可以正常工作，除了 !。

此外，我如何检查我的映射中当前应用了上述设置的分析器？

Answer 1

ElasticSearch Lucene 正则表达式引擎不支持任何类型的环视。 ES regex documentation is rather ambiguous saying matching everything like .* is very slow as well as using lookaround regular expressions（这不仅是模棱两可的，而且是错误的，因为环顾四周，如果使用得当，可能会大大加快正则表达式匹配）。

由于要匹配任何包含f04且不包含z的字符串，实际上可能使用

[^z]*fo4[^z]*

详情

[^z]* - z
fo4 - fo4 子字符串
[^z]* - z.

如果您有一个 "exclude" 的多字符字符串（例如，z4 而不是 z），您可以使用 complement operator:

.*f04.*&~(.*z4.*)

这个意思差不多但是不支持换行:

.* - 除换行符外的任何字符，尽可能多
f04 - f04
.* - 除换行符外的任何字符，尽可能多
& - 和
~(.*z4.*) - 除了具有 z4

负先行正则表达式在 ES dsl 查询中不起作用

negative lookahead Regexp doesnt work in ES dsl query

regex

negative-lookbehind

elasticsearch

elasticsearch-dsl

elasticsearch-dsl-py