如何在 ElasticSearch 中按非标记化字段长度进行搜索

Question

假设我创建了一个索引 people，它将采用具有两个属性的条目：name 和 friends

PUT /people
{
  "mappings": {
    "properties": {
      "friends": { 
        "type": "text",
        "fields": {
          "keyword": { 
            "type": "keyword"
          }
        }
      }
    }
  }
}

我放了两个条目，每个条目都有两个朋友。

POST /people/_doc
{
  "name": "Jack",
  "friends": [
    "Jill", "John"
  ]
}


POST /people/_doc
{
  "name": "Max",
  "friends": [
    "John", "John"  # Max will have two friends, but both named John
  ]
}

现在我想搜索有多个朋友的人

GET /people/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "script": {
            "script": {
              "source": "doc['friends.keyword'].length > 1"
            }
          }
        }
      ]
    }
  }
}

这只会 return Jack 而忽略 Max。我认为这是因为我们实际上是在遍历倒排索引，而 John 和 John 只创建了一个标记 - 即 'john' 所以标记的长度实际上是 1。

由于我的索引比较小，性能不是重点，所以我想实际遍历source而不是倒排索引

GET /people/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "script": {
            "script": {
              "source": "ctx._source.friends.length > 1"
            }
          }
        }
      ]
    }
  }
}

但是根据 https://github.com/elastic/elasticsearch/issues/20068 源仅在更新时支持，在搜索时不支持，所以我不能。

一个明显的解决方案似乎是获取字段的长度并将其存储到索引中。像 friends_count: 2 这样的东西，然后根据它进行过滤。但这需要重新编制索引，而且这似乎应该以我所缺少的一些明显方式解决。

非常感谢。

Answer 1

ES 7.11 中有一项新功能作为运行时字段运行时字段是在查询时评估的字段。运行时字段使您能够：

在不重新索引数据的情况下向现有文档添加字段
在不了解数据结构的情况下开始使用您的数据
覆盖查询时从索引字段返回的值
在不修改基础架构的情况下为特定用途定义字段

您可以找到有关运行时字段的更多信息here，但是如何使用运行时字段您可以这样做：

索引时间：

PUT my-index/
{
  "mappings": {
    "runtime": {
      "friends_count": {
        "type": "keyword",
        "script": {
          "source": "doc['@friends'].size()"
        }
      }
    },
    "properties": {
      "@timestamp": {"type": "date"}
    }
  }
}

您还可以在搜索时间使用运行时字段以获取更多信息检查here。

搜索时间

GET my-index/_search
{
  "runtime_mappings": {
    "friends_count": {
      "type": "keyword",
      "script": {
        "source": "ctx._source.friends.size()"
      }
    }
  }
}

更新：

POST mytest/_update_by_query
{
    "query": {
        "match_all": {}
    }, 
    "script": {
       "source": "ctx._source.arrayLength = ctx._source.friends.size()"
    }
}

您可以使用上面的查询更新您的所有文档并调整您的查询。

Answer 2

对于所有想知道同一问题的人，我认为@Kaveh 的回答是最有可能的方法，但我没能在我的案例中做到这一点。在我看来，源是在执行查询后创建的，因此您无法出于过滤查询的目的访问源。

你有两个选择：

在应用程序级别过滤结果（丑陋且缓慢的解决方案）
实际上将字段长度保存在一个单独的字段中。比如friends_count

可能还有一个我不知道的选项(?)。

如何在 ElasticSearch 中按非标记化字段长度进行搜索

How to search by non-tokenized field length in ElasticSearch

elasticsearch

elastic-stack