如何使用 elasticsearch 通过正则表达式从文本中查询电子邮件

Question

我想从存储在 est 中的文本中查询所有电子邮件，现在我使用这个查询条件并得到 query result

{
"query": {
    "regexp": {
        "sys_content": {
            "value": "[-a-zA-Z0-9_]+(\.[-a-zA-Z0-9_]+)*@[-a-zA-Z0-9_]+(\.[-a-zA-Z0-9_]+)+",
            "flags_value": 65535,
            "max_determinized_states": 10000,
            "boost": 1.0
        }
    }
},
"highlight": {
    "pre_tags": [
        "<span style='color:red'>"
    ],
    "post_tags": [
        "</span>"
    ],
    "fragment_size": 100,
    "require_field_match": true,
    "fields": {
        "sys_content": {}
    }
}

}

然后，我尝试查询“\@”但一无所获

Answer 1

这是一个使用 uax url email tokenizer 的解决方案。这将在索引时完成大部分工作，使您的搜索速度更快。

使用自定义分析器创建索引以创建标记和过滤器以仅保留那些标记：

PUT test-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter": ["extract_email"]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "uax_url_email",
          "max_token_length": 50
        }
      },
      "filter": {
        "extract_email": {
          "type": "keep_types",
          "types": [ "<EMAIL>" ]
        }
      }
    }
  },
  "mappings" : {
      "properties" : {
        "sys_content" : {
          "type" : "text",
          "fields": {
            "email": {
              "type": "text",
              "analyzer": "my_analyzer"
            }
          }
        }
      }
    }
}

然后添加文档：

POST test-index/_doc
{
  "sys_content": "test email@gmail.com not@ a@a email another@email.fr"
}

最后搜索并突出显示电子邮件。多亏了 uax url 电子邮件分词器，查找电子邮件已经在索引时完成，因此在搜索时，您只需匹配 sys_content.email 字段中的任何令牌：

GET test-index/_search
{
  "query": {
    "regexp": {
      "sys_content.email": {
        "value": ".*",
        "flags": "ALL",
        "case_insensitive": true,
        "max_determinized_states": 10000,
        "rewrite": "constant_score"
      }
    }
  },
  "highlight": {
    "pre_tags": [
        "<span style='color:red'>"
    ],
    "post_tags": [
        "</span>"
    ],
    "fragment_size": 100,
    "require_field_match": true,
    "fields": {
        "sys_content.email": {}
    }
  }
}

这会产生以下结果：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test-index",
        "_type" : "_doc",
        "_id" : "GxSbM3oBJxdf7EzzH4jM",
        "_score" : 1.0,
        "_source" : {
          "sys_content" : "test email@gmail.com not@ a@a email another@email.fr"
        },
        "highlight" : {
          "sys_content.email" : [
            "test <span style='color:red'>email@gmail.com</span> not@ a@a email <span style='color:red'>another@email.fr</span>"
          ]
        }
      }
    ]
  }
}

注意：必须有更好的方法来匹配字段中的任何标记而不使用正则表达式搜索，但我找不到它。无论如何，这有效，正则表达式非常简单，所以应该很快。

如何使用 elasticsearch 通过正则表达式从文本中查询电子邮件

How to use elasticsearch to query email from text with regex

lucene

elasticsearch

elasticsearch-query