Elasticsearch - 搜索通配符(包含在字符串中)和 tf-idf 分数
Elasticsearch - search wildcards (contains in strings) and tf-idf scores
如何进行搜索通配符和 tf-idf 分数。
例如,当我这样搜索时,
GET /test_es/_search?explain=true // return idf / dt scores
{
"explain":true,
"query": {
"query_string": {
"query": "bar^5",
"fields" : ["field"]
}
}
}
它returns idf 和 td 得分,
但是当我使用通配符(包含)进行搜索时。
GET /test_es/_search?explain=true // NOT RETURN idf/td score
{
"explain":true,
"query": {
"query_string": {
"query": "b*",
"fields" : ["field"]
}
}
}
如何使用通配符(在字符串中使用包含)进行搜索并包含 IDF-TD 分数?
比如我有3个文件
"foo", "foo bar", "foo baz"
当我这样搜索时
GET /foo2/_search?explain=true
{
"explain":true,
"query": {
"query_string": {
"query": "fo *",
"fields" : ["field"]
}
}
}
Elasticsearch 结果
"hits" : [
{
"_shard" : "[foo2][0]",
"_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
"_index" : "foo2",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"field" : "foo bar"
},
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
},
{
"_shard" : "[foo2][0]",
"_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
"_index" : "foo2",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"field" : "foo"
},
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
},
{
"_shard" : "[foo2][0]",
"_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
"_index" : "foo2",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"field" : "foo baz"
},
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
}
]
但我认为“foo”应该是得分最高的第一个结果,因为它匹配 %100,我错了吗?
由于您没有提到任何有关您所获取数据的信息,我已将以下数据编入索引:
索引数据:
{
"message": "A fox is a wild animal."
}
{
"message": "That fox must have killed the hen."
}
{
"message": "the quick brown fox jumps over the lazy dog"
}
搜索查询:
GET/{{index-name}}/_search?explain=true
{
"query": {
"query_string": {
"fields": [
"message" ---> You can add more fields here
],
"query": "quick^2 fox*"
}
}
}
上面的查询搜索所有包含fox
的文档,但是这里由于boost被应用到quick
,所以包含[=17=的文档] 与其他文件相比会有更高的分数。
此查询将 return tf-IDF 分数。
使用 boost 运算符,使一个术语比另一个术语更相关。
要了解更多信息,请参阅 "Boosting" in dsl-query-string
上的官方文档
想了解更多关于tf-IDF算法可以参考这个blog
如果要跨多个领域搜索,可以提高某个领域的分数
更新 1:
索引数据:
{
"title": "foo bar"
}
{
"title": "foo baz"
}
{
"title": "foo"
}
搜索查询:
{
"query": {
"query_string": {
"query": "foo *" --> You can just add a space between
foo and *
}
}
}
搜索结果:
"hits": [
{
"_index": "foo2",
"_type": "_doc",
"_id": "1",
"_score": 1.9808292, --> foo matches exactly, so the
score is maximum
"_source": {
"title": "foo"
}
},
{
"_index": "foo2",
"_type": "_doc",
"_id": "2",
"_score": 1.1234324,
"_source": {
"title": "foo bar"
}
},
{
"_index": "foo2",
"_type": "_doc",
"_id": "3",
"_score": 1.1234324,
"_source": {
"title": "foo baz"
}
}
]
更新二:
Wildcard Queries basically falls under Term-level queries, and by
default uses the constant_score_boolean method for matching terms.
通过更改 rewrite parameter 的值,您可以影响搜索性能和相关性。它有多种评分选项,您可以根据自己的需要选择任何一种。
但根据您的使用情况,您也可以使用 edge_ngram 过滤器。
Edge N-Grams 对于输入即搜索的查询很有用。要了解更多信息以及下面使用的映射,请参考官方 documentation
索引映射:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
索引样本数据:
{ "title":"foo" }
{ "title":"foo bar" }
{ "title":"foo baz" }
搜索查询:
{
"query": {
"match": {
"title": {
"query": "fo"
}
}
}
}
搜索结果:
"hits": [
{
"_index": "foo6",
"_type": "_doc",
"_id": "1",
"_score": 0.15965709, --> Maximum score
"_source": {
"title": "foo"
}
},
{
"_index": "foo6",
"_type": "_doc",
"_id": "2",
"_score": 0.12343237,
"_source": {
"title": "foo bar"
}
},
{
"_index": "foo6",
"_type": "_doc",
"_id": "3",
"_score": 0.12343237,
"_source": {
"title": "foo baz"
}
}
]
要了解更多关于在 Elasticsearch 中使用 Ngram 的基础知识,您可以参考 this
如何进行搜索通配符和 tf-idf 分数。 例如,当我这样搜索时,
GET /test_es/_search?explain=true // return idf / dt scores
{
"explain":true,
"query": {
"query_string": {
"query": "bar^5",
"fields" : ["field"]
}
}
}
它returns idf 和 td 得分, 但是当我使用通配符(包含)进行搜索时。
GET /test_es/_search?explain=true // NOT RETURN idf/td score
{
"explain":true,
"query": {
"query_string": {
"query": "b*",
"fields" : ["field"]
}
}
}
如何使用通配符(在字符串中使用包含)进行搜索并包含 IDF-TD 分数?
比如我有3个文件 "foo", "foo bar", "foo baz" 当我这样搜索时
GET /foo2/_search?explain=true
{
"explain":true,
"query": {
"query_string": {
"query": "fo *",
"fields" : ["field"]
}
}
}
Elasticsearch 结果
"hits" : [
{
"_shard" : "[foo2][0]",
"_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
"_index" : "foo2",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"field" : "foo bar"
},
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
},
{
"_shard" : "[foo2][0]",
"_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
"_index" : "foo2",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"field" : "foo"
},
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
},
{
"_shard" : "[foo2][0]",
"_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
"_index" : "foo2",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"field" : "foo baz"
},
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
}
]
但我认为“foo”应该是得分最高的第一个结果,因为它匹配 %100,我错了吗?
由于您没有提到任何有关您所获取数据的信息,我已将以下数据编入索引:
索引数据:
{
"message": "A fox is a wild animal."
}
{
"message": "That fox must have killed the hen."
}
{
"message": "the quick brown fox jumps over the lazy dog"
}
搜索查询:
GET/{{index-name}}/_search?explain=true
{
"query": {
"query_string": {
"fields": [
"message" ---> You can add more fields here
],
"query": "quick^2 fox*"
}
}
}
上面的查询搜索所有包含fox
的文档,但是这里由于boost被应用到quick
,所以包含[=17=的文档] 与其他文件相比会有更高的分数。
此查询将 return tf-IDF 分数。 使用 boost 运算符,使一个术语比另一个术语更相关。
要了解更多信息,请参阅 "Boosting" in dsl-query-string
上的官方文档想了解更多关于tf-IDF算法可以参考这个blog
如果要跨多个领域搜索,可以提高某个领域的分数
更新 1:
索引数据:
{
"title": "foo bar"
}
{
"title": "foo baz"
}
{
"title": "foo"
}
搜索查询:
{
"query": {
"query_string": {
"query": "foo *" --> You can just add a space between
foo and *
}
}
}
搜索结果:
"hits": [
{
"_index": "foo2",
"_type": "_doc",
"_id": "1",
"_score": 1.9808292, --> foo matches exactly, so the
score is maximum
"_source": {
"title": "foo"
}
},
{
"_index": "foo2",
"_type": "_doc",
"_id": "2",
"_score": 1.1234324,
"_source": {
"title": "foo bar"
}
},
{
"_index": "foo2",
"_type": "_doc",
"_id": "3",
"_score": 1.1234324,
"_source": {
"title": "foo baz"
}
}
]
更新二:
Wildcard Queries basically falls under Term-level queries, and by default uses the constant_score_boolean method for matching terms.
通过更改 rewrite parameter 的值,您可以影响搜索性能和相关性。它有多种评分选项,您可以根据自己的需要选择任何一种。
但根据您的使用情况,您也可以使用 edge_ngram 过滤器。 Edge N-Grams 对于输入即搜索的查询很有用。要了解更多信息以及下面使用的映射,请参考官方 documentation
索引映射:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
索引样本数据:
{ "title":"foo" }
{ "title":"foo bar" }
{ "title":"foo baz" }
搜索查询:
{
"query": {
"match": {
"title": {
"query": "fo"
}
}
}
}
搜索结果:
"hits": [
{
"_index": "foo6",
"_type": "_doc",
"_id": "1",
"_score": 0.15965709, --> Maximum score
"_source": {
"title": "foo"
}
},
{
"_index": "foo6",
"_type": "_doc",
"_id": "2",
"_score": 0.12343237,
"_source": {
"title": "foo bar"
}
},
{
"_index": "foo6",
"_type": "_doc",
"_id": "3",
"_score": 0.12343237,
"_source": {
"title": "foo baz"
}
}
]
要了解更多关于在 Elasticsearch 中使用 Ngram 的基础知识,您可以参考 this