在elasticsearch中通过全名灵活搜索用户

Flexible search users by full name in elasticsearch

我需要提供灵活的全名搜索,满足以下要求:

  1. 可以按名称搜索
  2. 可以按姓氏搜索
  3. 可以按姓名搜索,反之亦然
  4. 可以按部分姓名或姓氏进行搜索

作为输入,我只有字符串,所以它是名字还是姓氏并不重要。 所以我决定使用 edge ngram tokenizer 并支持搜索变音符号。

我有以下索引:

DELETE test.full.name

PUT test.full.name

{
    "settings": {
        "index": {
            "number_of_shards": "1",
            "analysis": {
                "filter": {
                    "edge_ngram_tokenizer": {
                        "token_chars": [
                            "letter",
                            "digit"
                        ],
                        "min_gram": "3",
                        "type": "edge_ngram",
                        "max_gram": "3"
                    }
                },
                "analyzer": {
                    "edge_ngram_multi_lang": {
                        "filter": [
                            "lowercase",
                            "german_normalization",
                            "edge_ngram_tokenizer"
                        ],
                        "tokenizer": "standard"
                    }
                }
            },
            "number_of_replicas": "1"
        }
    },
    "mappings": {
      "properties": {
        "fullName": {
          "type": "text",
          "analyzer": "edge_ngram_multi_lang"
        }
      }
  }
}

并创建一些包含数据的文档:

POST test.full.name/_doc
{
    "fullName": "Ruslan test"
}

POST test.full.name/_doc
{
    "fullName": "Russell test"
}

POST test.full.name/_doc
{
    "fullName": "Rust test"
}

查询搜索是:

GET test.full.name/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "fullName": {
                                        "query": "ruslan",
                                        "operator": "and"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

它 return 包含所有三个文档,但它必须 return 仅存在 ruslan 值的文档。

下一个搜索查询:

GET test.full.name/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "fullName": {
                                        "query": "ruslan test",
                                        "operator": "and"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

它 return 也有三个文件,但预期只有带有“ruslan test”的文件。 此外,应该可以在任何查询搜索顺序中按全名查找用户,当然部分搜索也应该像按“rus”搜索一样工作 return 所有具有 fullName 这样值的文档。

同时使用“Ruslan test”查询应该 returns 文档带有“test ruslan”、“ruslan test” 查询“test ruslan”也是如此。

那么应该如何配置索引才能接受上述要求?

您正在使用 edge_ngram_tokenizer, which according to your index setting, will produce N-grams with a minimum length of 3 and a maximum length of 3. You can test this by using Analyze API :

GET /_analyze
{
  "analyzer" : "edge_ngram_multi_lang",
  "text" : "Ruslan test"
}

生成的令牌是:

{
    "tokens": [
        {
            "token": "rus",
            "start_offset": 0,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "tes",
            "start_offset": 7,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

因为这不是您的要求,您应该使用 Shingle token filter 而不是 Edge-ngram


添加具有索引映射、搜索查询和搜索结果的工作示例

索引映射:

{
    "settings": {
        "index": {
            "number_of_shards": "1",
            "analysis": {
                "filter": {
                    "my_shingle_filter": {
                        "type": "shingle",
                        "min_shingle_size": 2,
                        "max_shingle_size": 3
                    }
                },
                "analyzer": {
                    "edge_ngram_multi_lang": {
                        "filter": [
                            "lowercase",
                            "german_normalization",
                            "my_shingle_filter"
                        ],
                        "tokenizer": "standard"
                    }
                }
            },
            "number_of_replicas": "1"
        }
    },
    "mappings": {
        "properties": {
            "fullName": {
                "type": "text",
                "analyzer": "edge_ngram_multi_lang"
            }
        }
    }
}

现在生成的令牌将是

{
    "tokens": [
        {
            "token": "ruslan",
            "start_offset": 0,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "ruslan test",
            "start_offset": 0,
            "end_offset": 11,
            "type": "shingle",
            "position": 0,
            "positionLength": 2
        },
        {
            "token": "test",
            "start_offset": 7,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

搜索API:

{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "fullName": {
                                        "query": "test Ruslan",
                                        "operator": "and"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

搜索结果:

"hits": [
            {
                "_index": "my-idx",
                "_id": "4",
                "_score": 0.9150312,
                "_source": {
                    "fullName": "test Ruslan"
                }
            },
            {
                "_index": "my-idx",
                "_id": "1",
                "_score": 0.88840073,
                "_source": {
                    "fullName": "Ruslan test"
                }
            }
        ]

更新 1:

如果部分搜索也是您的要求,那么您应该选择 Search-as-you field type

但您也可以使用上面答案中定义的相同索引映射设置(因为我们已经在使用带状疱疹)。但是您需要将搜索查询修改为:

{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "multi_match": {
                                    "query": "rusl",
                                    "type": "bool_prefix",
                                    "fields": [
                                        "fullName",
                                        "fullName._2gram",
                                        "fullName._3gram"
                                    ],
                                    "operator": "AND"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

上面的索引映射和设置可以实现问题中所有的测试场景