Elasticsearch 6.8 中模糊搜索的最佳实践（如 MySQL 中的 '%aaa%'）是什么

Question

背景：我使用Mysql，有数百万条数据，每行有二十列，我们有一些复杂的搜索和某些列使用模糊匹配，例如 username like '%aaa%'，它不能使用 mysql 索引，除非删除第一个 %，但我们需要模糊匹配来进行搜索，如 Satckoverflow 搜索，我还检查了 Mysql fulltext index，但如果使用其他索引，它不支持一个 sql 的复杂搜索。

我的解决方案：添加Elasticsearch作为我们的搜索引擎，在Mysql和Es中插入数据，只在Elasticsearch

中搜索数据

我查了下Elasticsearch模糊搜索，wildcard可以，但是很多人不建议在单词开头使用*，这样搜索会很慢。

例如：用户名：'John_Snow'

wildcard 有效但可能很慢

GET /user/_search
{
  "query": {
    "wildcard": {
      "username": "*hn*"
    }
  }
}

match_phrase 不起作用似乎只适用于像短语 'John Snow'

这样的 Tokenizer

{
  "query": {
      "match_phrase":{
      "dbName": "hn"
      }
  }
}

我的问题：是否有更好的解决方案来执行包含“%no%”或“%hn_Sn%”等模糊匹配的复杂查询。

Answer 1

You can use ngram tokenizer that first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

添加一个包含索引数据、映射、搜索查询和结果的工作示例。

索引映射：

     {
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        },
        "max_ngram_diff": 50
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "standard"
            }
        }
    }
}

分析API

POST/ _analyze

{
  "analyzer": "my_analyzer",
  "text": "John_Snow"
}

令牌是：

   {
    "tokens": [
        {
            "token": "Jo",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "Joh",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "John",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 2
        },
        {
            "token": "oh",
            "start_offset": 1,
            "end_offset": 3,
            "type": "word",
            "position": 3
        },
        {
            "token": "ohn",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 4
        },
        {
            "token": "hn",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 5
        },
        {
            "token": "Sn",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 6
        },
        {
            "token": "Sno",
            "start_offset": 5,
            "end_offset": 8,
            "type": "word",
            "position": 7
        },
        {
            "token": "Snow",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 8
        },
        {
            "token": "no",
            "start_offset": 6,
            "end_offset": 8,
            "type": "word",
            "position": 9
        },
        {
            "token": "now",
            "start_offset": 6,
            "end_offset": 9,
            "type": "word",
            "position": 10
        },
        {
            "token": "ow",
            "start_offset": 7,
            "end_offset": 9,
            "type": "word",
            "position": 11
        }
    ]
}

索引数据：

{
  "title":"John_Snow"
}

搜索查询：

{
    "query": {
        "match" : {
            "title" : "hn"
        }
    }
}

搜索结果：

"hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "title": "John_Snow"
                }
            }
        ]

另一个搜索查询

{
    "query": {
        "match" : {
            "title" : "ohr"
        }
    }
}

以上搜索查询未显示任何结果

Elasticsearch 6.8 中模糊搜索的最佳实践（如 MySQL 中的 '%aaa%'）是什么

What is the best practice of fuzzy search (like '%aaa%' in MySQL) in Elasticsearch 6.8

wildcard

elasticsearch