Elasticsearch 6.8 中模糊搜索的最佳实践(如 MySQL 中的 '%aaa%')是什么
What is the best practice of fuzzy search (like '%aaa%' in MySQL) in Elasticsearch 6.8
背景:我使用Mysql,有数百万条数据,每行有二十列,我们有一些复杂的
搜索和某些列使用模糊匹配,例如 username like '%aaa%'
,它不能使用 mysql 索引,除非删除第一个 %
,但我们需要模糊匹配来进行搜索,如 Satckoverflow 搜索,我还检查了 Mysql fulltext index
,但如果使用其他索引,它不支持一个 sql 的复杂搜索。
我的解决方案:添加Elasticsearch作为我们的搜索引擎,在Mysql和Es中插入数据,只在Elasticsearch
中搜索数据
我查了下Elasticsearch模糊搜索,wildcard
可以,但是很多人不建议在单词开头使用*
,这样搜索会很慢。
例如:用户名:'John_Snow'
wildcard
有效但可能很慢
GET /user/_search
{
"query": {
"wildcard": {
"username": "*hn*"
}
}
}
match_phrase
不起作用似乎只适用于像短语 'John Snow'
这样的 Tokenizer
{
"query": {
"match_phrase":{
"dbName": "hn"
}
}
}
我的问题:是否有更好的解决方案来执行包含“%no%”或“%hn_Sn%”等模糊匹配的复杂查询。
You can use ngram tokenizer that first breaks text down into
words whenever it encounters one of a list of specified characters,
then it emits N-grams of each word of the specified length.
添加一个包含索引数据、映射、搜索查询和结果的工作示例。
索引映射:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
分析API
POST/ _analyze
{
"analyzer": "my_analyzer",
"text": "John_Snow"
}
令牌是:
{
"tokens": [
{
"token": "Jo",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "Joh",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "John",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "oh",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "ohn",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "hn",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "Sn",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 6
},
{
"token": "Sno",
"start_offset": 5,
"end_offset": 8,
"type": "word",
"position": 7
},
{
"token": "Snow",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "no",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 9
},
{
"token": "now",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 10
},
{
"token": "ow",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 11
}
]
}
索引数据:
{
"title":"John_Snow"
}
搜索查询:
{
"query": {
"match" : {
"title" : "hn"
}
}
}
搜索结果:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "John_Snow"
}
}
]
另一个搜索查询
{
"query": {
"match" : {
"title" : "ohr"
}
}
}
以上搜索查询未显示任何结果
背景:我使用Mysql,有数百万条数据,每行有二十列,我们有一些复杂的
搜索和某些列使用模糊匹配,例如 username like '%aaa%'
,它不能使用 mysql 索引,除非删除第一个 %
,但我们需要模糊匹配来进行搜索,如 Satckoverflow 搜索,我还检查了 Mysql fulltext index
,但如果使用其他索引,它不支持一个 sql 的复杂搜索。
我的解决方案:添加Elasticsearch作为我们的搜索引擎,在Mysql和Es中插入数据,只在Elasticsearch
中搜索数据我查了下Elasticsearch模糊搜索,wildcard
可以,但是很多人不建议在单词开头使用*
,这样搜索会很慢。
例如:用户名:'John_Snow'
wildcard
有效但可能很慢
GET /user/_search
{
"query": {
"wildcard": {
"username": "*hn*"
}
}
}
match_phrase
不起作用似乎只适用于像短语 'John Snow'
{
"query": {
"match_phrase":{
"dbName": "hn"
}
}
}
我的问题:是否有更好的解决方案来执行包含“%no%”或“%hn_Sn%”等模糊匹配的复杂查询。
You can use ngram tokenizer that first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
添加一个包含索引数据、映射、搜索查询和结果的工作示例。
索引映射:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
分析API
POST/ _analyze
{
"analyzer": "my_analyzer",
"text": "John_Snow"
}
令牌是:
{
"tokens": [
{
"token": "Jo",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "Joh",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "John",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "oh",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "ohn",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "hn",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "Sn",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 6
},
{
"token": "Sno",
"start_offset": 5,
"end_offset": 8,
"type": "word",
"position": 7
},
{
"token": "Snow",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "no",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 9
},
{
"token": "now",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 10
},
{
"token": "ow",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 11
}
]
}
索引数据:
{
"title":"John_Snow"
}
搜索查询:
{
"query": {
"match" : {
"title" : "hn"
}
}
}
搜索结果:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "John_Snow"
}
}
]
另一个搜索查询
{
"query": {
"match" : {
"title" : "ohr"
}
}
}
以上搜索查询未显示任何结果