ElasticSearch 查询优化 - Java API

ElasticSearch query optimization - Java API

我是 ES 的新手,正在搜索 100k 数据的记录集。 这是我的映射和设置 JSON,我用它索引了我的数据:

setings.json

{
    "index": {
        "analysis": {
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 10
                }
            },
            "analyzer": {
                "ngram_tokenizer_analyzer": {
                    "type": "custom",
                    "tokenizer": "ngram_tokenizer"
                }
            }
        }
    }
}

mappings.json

{
    "product": {
        "properties": {
            "name": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "description": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "vendorModelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "brand": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "specifications": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "upc": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "storeSkuId": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            },
            "modelNumber": {
                "type": "string",
                "analyzer": "ngram_tokenizer_analyzer",
                "store": true
            }
        }
    }
}

我需要根据某些优先级根据提到的所有字段查询文档。这是我搜索所有记录的查询。

BoolQueryBuilder query = QueryBuilders.boolQuery();
int boost = 7;

for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("name", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("description", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("modelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("vendorModelNumber", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("storeSkuId", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("upc", "*" + str.toLowerCase() + "*").boost(boost));
}
boost--;
for (String str : dataSplit) {
    query.should(QueryBuilders.wildcardQuery("brand", "*" + str.toLowerCase() + "*").boost(boost));
}
client.prepareSearch(index).setQuery(query).setSize(200).setExplain(true).execute().actionGet();

查询确实帮助我搜索数据并且工作正常,但我的问题是它需要很多时间,因为我使用的是通配符查询。 有人可以帮助优化此查询或指导我找到最适合我的搜索的查询吗? TIA。

First off, let me answer the simple question first: handle case sensitivity. If you define a custom analyzer, you can add different filters, which are applied to each token after the input has been processed by the tokenizer.

{
"index": {
    "analysis": {
        "tokenizer": {
            "ngram_tokenizer": {
                "type": "ngram",
                "min_gram": 3,
                "max_gram": 10
            }
        },
        "analyzer": {
            "ngram_tokenizer_analyzer": {
                "type": "custom",
                "tokenizer": "ngram_tokenizer",
                "filter": [
                    "lowercase",
                    ...
                ]
            }
        }
    }
}

As you see, there is an existing lowercase filter, which will simply transform all tokens to lower case. I strongly recommend referring to the documentation. There are a lot of these token filters.


Now the more complicated part: NGram tokenizers. Again, for deeper understanding, you might want to read docs. But referring to your problem, your tokenizer will essentially create terms of length 3 to 10. Which means the text

I am an example TEXT.

Will basically create a lot of tokens. Just to show a few:

  • Size 3: "I a", " am", "am ", ..., "TEX", "EXT"
  • Size 4: "I am", " am ", "am a", ..., " TEX", "TEXT".
  • Size 10: "I am an ex", ...

You get the idea. (The lowercase token filter would lowercase these tokens now)

Difference between Match and Term Query: Match queries are analyzed, while term queries are not. In fact, that means your match queries can match multiple terms. Example: you match exam".

This would match 3 terms in fact: exa, xam and exam.

This has influence on the score of the matches. The more matches, the higher the score. In some cases it's desired, in other cases not.

A term query is not analyzed, which means exam would match, but only one term (exam of course). However, since it's not analyzed, it's also not lowercased, meaning you have to do that in code yourself. Exam would never match, because there is no term with capital letters in your index, if you use the lowercase tokenfilter.

Not sure about your use-case. But I have a feeling, that you could (or even want) indeed use the term query. But be aware, there are no terms in your index with a size bigger than 10. Because that's what your ngram-tokenizer does.

/ EDIT:

Something worth pointing out regarding match queries, and the reason why you might want to use terms: Some match queries like Simple will also match mple from example.