ElasticSearch 查找是否所有文档词都在查询词中

ElasticSearch find if all document words are in query term

我有一个查询,给定 1 个或多个单词的术语搜索所有文档,其中 name 文本字段包含该术语的所有标记并且单词长度的差异不超过 2 个单词.

tokens = term.split()
es_client.search(index='my_index_name',
                body={
                    "size": 1000,
                    "query": {
                        "bool": {
                            "must": [
                                {
                                    "range": {
                                        "n_tokens": {"gte": n_tokens, "lt": n_tokens+3}
                                    }
                                },
                                {
                                    "bool": {
                                        "should": [
                                            {"term": {"name": token}}
                                            for token in tokens
                                        ],
                                        "minimum_should_match": n_tokens
                                    }
                                },
                            ]
                        }
                    }
                })

如果索引文档是:

[{'name': 'big apple', 'n_tokens': 2},
 {'name': 'big red apple', 'n_tokens': 3},
 {'name': 'red small apple', 'n_tokens': 3},
 {'name': 'my very tasty red apple', 'n_tokens': 5}]
given query term `red apple`would return only second and third document

我想要相同的功能,但反过来。基本上检查所有索引文档术语标记是否在给定的标记列表中。

如果索引文档是:

[{'name': 'big apple', 'n_tokens': 2},
 {'name': 'big red apple', 'n_tokens': 3},
 {'name': 'big red orange', 'n_tokens': 3}]
given query tokens ['my', 'tasty', 'big', 'red', 'apple'] would only return 1st and second document

有可能实现吗?

只能使用脚本。

在脚本中,我们首先在空格处拆分文本,然后检查给定搜索数组中是否存在所有标记

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "my tasty big red apple"
          }
        },
        {
          "script": {
            "script": {
              "source": """
                        def tokens=[];
                         String target = " ";
                         String someString = doc['name.keyword'].value;
                         StringTokenizer tokenValue = new StringTokenizer(someString, target);

                         while (tokenValue.hasMoreTokens()) {
                           tokens.add(tokenValue.nextToken());
                         }
                         
                         if(params.query_tokens.containsAll(tokens)){return true;}
                     """,
              "params": {
                "query_tokens": [
                  "my",
                  "tasty",
                  "big",
                  "red",
                  "apple"
                ]
              }
            }
          }
        }
      ]
    }
  }
}

结果

"hits" : [
      {
        "_index" : "index14",
        "_type" : "_doc",
        "_id" : "trn-m3sBKGIYIG8qZHBE",
        "_score" : 2.1927156,
        "_source" : {
          "name" : "big red apple",
          "n_tokens" : 3
        }
      },
      {
        "_index" : "index14",
        "_type" : "_doc",
        "_id" : "tbn-m3sBKGIYIG8qX3D-",
        "_score" : 1.9476067,
        "_source" : {
          "name" : "big apple",
          "n_tokens" : 2
        }
      }
    ]

脚本速度较慢且扩展性不佳

你可以这样做,在字面上,生成一个 minimum_should_match 等于 term n_tokens

{
    "size": 1000,
    "query": {
        "bool": {
            "should": [
                {
                    "bool": {
                        "must": {
                            "term": {
                                "n_tokens": n_token + 1
                            }
                        },
                        "should": [
                            {"term": {"name": token}}
                            for token in tokens
                        ],
                        "minimum_should_match": n_token + 1
                    }
                } for n_token in range(len(tokens))
            ],
            "minimum_should_match": 1
        }
    }
}