如何构建考虑单词之间距离的 Elasticsearch 查询?
How to build an Elasticsearch query that will take into account the distance between words?
我是 运行 elasticsearch:7.6.2
我有一个包含 4 个简单文档的索引:
PUT demo_idx/_doc/1
{
"content": "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
PUT demo_idx/_doc/2
{
"content": "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/3
{
"content": "Distributed nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/4
{
"content": "Distributed tmp tmp nature"
}
我要搜索文本:distributed nature
得到的结果顺序如下:
Doc id: 3
Doc id: 1
Doc id: 2
Doc id: 4
即完全匹配的文档(doc 3 和 doc 1)将在具有小斜率的文档(doc 2)之前显示,而具有大斜率匹配的文档将最后显示(doc 4)
我读了这个post:
How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word 但它对我没有帮助
我试过以下搜索查询:
"query": {
"bool": {
"must":
[{
"match_phrase": {
"content": {
"query": query,
"slop": 2
}
}
}]
}
}
但它没有给我所需的结果。
我得到了以下结果:
Doc id: 3 ,Score: 0.22949813
Doc id: 4 ,Score: 0.15556586
Doc id: 1 ,Score: 0.15401536
Doc id: 2 ,Score: 0.14397088
如何编写查询以获得我想要的结果?
您可以使用bool should 子句显示与“分布式性质”完全匹配的文档。第一个条款将提高那些与“分布式性质”完全匹配的文档的分数,没有任何懈怠。
POST demo_idx/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"content": {
"query": "Distributed nature"
}
}
},
{
"match_phrase": {
"content": {
"query": "Distributed nature",
"slop": 2
}
}
}
]
}
}
}
搜索响应将是:
"hits" : [
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.45899627,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.30803072,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.15556586,
"_source" : {
"content" : "Distributed tmp tmp nature"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.14397088,
"_source" : {
"content" : "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
}
]
更新 1:
为了避免“字段长度”参数对搜索查询评分的影响,您需要禁用“内容”字段的“规范”参数,使用更新映射API
PUT demo_idx/_mapping
{
"properties": {
"content": {
"type": "text",
"norms": "false"
}
}
}
在此之后,再次重新索引文档,这样norms不会立即被删除
现在点击搜索查询,搜索响应将按照您期望的顺序排列。
我是 运行 elasticsearch:7.6.2
我有一个包含 4 个简单文档的索引:
PUT demo_idx/_doc/1
{
"content": "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
PUT demo_idx/_doc/2
{
"content": "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/3
{
"content": "Distributed nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/4
{
"content": "Distributed tmp tmp nature"
}
我要搜索文本:distributed nature
得到的结果顺序如下:
Doc id: 3
Doc id: 1
Doc id: 2
Doc id: 4
即完全匹配的文档(doc 3 和 doc 1)将在具有小斜率的文档(doc 2)之前显示,而具有大斜率匹配的文档将最后显示(doc 4)
我读了这个post: How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word 但它对我没有帮助
我试过以下搜索查询:
"query": {
"bool": {
"must":
[{
"match_phrase": {
"content": {
"query": query,
"slop": 2
}
}
}]
}
}
但它没有给我所需的结果。
我得到了以下结果:
Doc id: 3 ,Score: 0.22949813
Doc id: 4 ,Score: 0.15556586
Doc id: 1 ,Score: 0.15401536
Doc id: 2 ,Score: 0.14397088
如何编写查询以获得我想要的结果?
您可以使用bool should 子句显示与“分布式性质”完全匹配的文档。第一个条款将提高那些与“分布式性质”完全匹配的文档的分数,没有任何懈怠。
POST demo_idx/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"content": {
"query": "Distributed nature"
}
}
},
{
"match_phrase": {
"content": {
"query": "Distributed nature",
"slop": 2
}
}
}
]
}
}
}
搜索响应将是:
"hits" : [
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.45899627,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.30803072,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.15556586,
"_source" : {
"content" : "Distributed tmp tmp nature"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.14397088,
"_source" : {
"content" : "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
}
]
更新 1:
为了避免“字段长度”参数对搜索查询评分的影响,您需要禁用“内容”字段的“规范”参数,使用更新映射API
PUT demo_idx/_mapping
{
"properties": {
"content": {
"type": "text",
"norms": "false"
}
}
}
在此之后,再次重新索引文档,这样norms不会立即被删除
现在点击搜索查询,搜索响应将按照您期望的顺序排列。