ElasticSearch 查找是否所有文档词都在查询词中
ElasticSearch find if all document words are in query term
我有一个查询,给定 1 个或多个单词的术语搜索所有文档,其中 name
文本字段包含该术语的所有标记并且单词长度的差异不超过 2 个单词.
tokens = term.split()
es_client.search(index='my_index_name',
body={
"size": 1000,
"query": {
"bool": {
"must": [
{
"range": {
"n_tokens": {"gte": n_tokens, "lt": n_tokens+3}
}
},
{
"bool": {
"should": [
{"term": {"name": token}}
for token in tokens
],
"minimum_should_match": n_tokens
}
},
]
}
}
})
如果索引文档是:
[{'name': 'big apple', 'n_tokens': 2},
{'name': 'big red apple', 'n_tokens': 3},
{'name': 'red small apple', 'n_tokens': 3},
{'name': 'my very tasty red apple', 'n_tokens': 5}]
given query term `red apple`would return only second and third document
我想要相同的功能,但反过来。基本上检查所有索引文档术语标记是否在给定的标记列表中。
如果索引文档是:
[{'name': 'big apple', 'n_tokens': 2},
{'name': 'big red apple', 'n_tokens': 3},
{'name': 'big red orange', 'n_tokens': 3}]
given query tokens ['my', 'tasty', 'big', 'red', 'apple'] would only return 1st and second document
有可能实现吗?
只能使用脚本。
在脚本中,我们首先在空格处拆分文本,然后检查给定搜索数组中是否存在所有标记
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "my tasty big red apple"
}
},
{
"script": {
"script": {
"source": """
def tokens=[];
String target = " ";
String someString = doc['name.keyword'].value;
StringTokenizer tokenValue = new StringTokenizer(someString, target);
while (tokenValue.hasMoreTokens()) {
tokens.add(tokenValue.nextToken());
}
if(params.query_tokens.containsAll(tokens)){return true;}
""",
"params": {
"query_tokens": [
"my",
"tasty",
"big",
"red",
"apple"
]
}
}
}
}
]
}
}
}
结果
"hits" : [
{
"_index" : "index14",
"_type" : "_doc",
"_id" : "trn-m3sBKGIYIG8qZHBE",
"_score" : 2.1927156,
"_source" : {
"name" : "big red apple",
"n_tokens" : 3
}
},
{
"_index" : "index14",
"_type" : "_doc",
"_id" : "tbn-m3sBKGIYIG8qX3D-",
"_score" : 1.9476067,
"_source" : {
"name" : "big apple",
"n_tokens" : 2
}
}
]
脚本速度较慢且扩展性不佳
你可以这样做,在字面上,生成一个 minimum_should_match
等于 term
n_tokens
{
"size": 1000,
"query": {
"bool": {
"should": [
{
"bool": {
"must": {
"term": {
"n_tokens": n_token + 1
}
},
"should": [
{"term": {"name": token}}
for token in tokens
],
"minimum_should_match": n_token + 1
}
} for n_token in range(len(tokens))
],
"minimum_should_match": 1
}
}
}
我有一个查询,给定 1 个或多个单词的术语搜索所有文档,其中 name
文本字段包含该术语的所有标记并且单词长度的差异不超过 2 个单词.
tokens = term.split()
es_client.search(index='my_index_name',
body={
"size": 1000,
"query": {
"bool": {
"must": [
{
"range": {
"n_tokens": {"gte": n_tokens, "lt": n_tokens+3}
}
},
{
"bool": {
"should": [
{"term": {"name": token}}
for token in tokens
],
"minimum_should_match": n_tokens
}
},
]
}
}
})
如果索引文档是:
[{'name': 'big apple', 'n_tokens': 2},
{'name': 'big red apple', 'n_tokens': 3},
{'name': 'red small apple', 'n_tokens': 3},
{'name': 'my very tasty red apple', 'n_tokens': 5}]
given query term `red apple`would return only second and third document
我想要相同的功能,但反过来。基本上检查所有索引文档术语标记是否在给定的标记列表中。
如果索引文档是:
[{'name': 'big apple', 'n_tokens': 2},
{'name': 'big red apple', 'n_tokens': 3},
{'name': 'big red orange', 'n_tokens': 3}]
given query tokens ['my', 'tasty', 'big', 'red', 'apple'] would only return 1st and second document
有可能实现吗?
只能使用脚本。
在脚本中,我们首先在空格处拆分文本,然后检查给定搜索数组中是否存在所有标记
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "my tasty big red apple"
}
},
{
"script": {
"script": {
"source": """
def tokens=[];
String target = " ";
String someString = doc['name.keyword'].value;
StringTokenizer tokenValue = new StringTokenizer(someString, target);
while (tokenValue.hasMoreTokens()) {
tokens.add(tokenValue.nextToken());
}
if(params.query_tokens.containsAll(tokens)){return true;}
""",
"params": {
"query_tokens": [
"my",
"tasty",
"big",
"red",
"apple"
]
}
}
}
}
]
}
}
}
结果
"hits" : [
{
"_index" : "index14",
"_type" : "_doc",
"_id" : "trn-m3sBKGIYIG8qZHBE",
"_score" : 2.1927156,
"_source" : {
"name" : "big red apple",
"n_tokens" : 3
}
},
{
"_index" : "index14",
"_type" : "_doc",
"_id" : "tbn-m3sBKGIYIG8qX3D-",
"_score" : 1.9476067,
"_source" : {
"name" : "big apple",
"n_tokens" : 2
}
}
]
脚本速度较慢且扩展性不佳
你可以这样做,在字面上,生成一个 minimum_should_match
等于 term
n_tokens
{
"size": 1000,
"query": {
"bool": {
"should": [
{
"bool": {
"must": {
"term": {
"n_tokens": n_token + 1
}
},
"should": [
{"term": {"name": token}}
for token in tokens
],
"minimum_should_match": n_token + 1
}
} for n_token in range(len(tokens))
],
"minimum_should_match": 1
}
}
}