按令牌计数过滤搜索

Question

分析文档中的字段以创建令牌。

{"message":"hello world"} -> 令牌：["hello", "world"]
{"message":"hello"} -> 令牌：["hello"]
{"message":"world"} -> 令牌：["world"]
{"message":"hello java"} -> 令牌：["hello", "java"]
{"message":"java"} -> 令牌：["java"]

是否可以搜索特定字段包含给定标记和 1 个或多个标记的所有文档？

令牌 "hello" 的给定示例的结果将是：
- 1,4
对于"world"：
- 1

如 termvectors 中所述，可以访问代币或有关它们的统计信息。这仅适用于特定文档，但不适用于查询或聚合的搜索过滤器。
如果有人能帮忙就好了。

Answer 1

是的，您可以为此使用 token_count 类型。例如，在您的映射中，您可以将 message 定义为多字段以包含消息本身（即 "hello"、"hello world" 等）以及消息的标记数.然后您就可以在查询中包含字数限制。

因此 message 的映射应如下所示：

curl -XPUT localhost:9200/tests -d '
{
  "mappings": {
    "test": {
      "properties": {
        "message": {
          "type": "string",           <--- message is a normal analyzed string
          "fields": {
            "word_count": {           <--- a sub-field to include the word count
              "type": "token_count",
              "store": "yes",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

然后，您可以查询消息中包含 hello 的所有文档，但只能查询 message 具有多个标记的文档。使用以下查询，您只会得到 hello java 和 hello world，但不会得到 hello

curl -XPOST localhost:9200/tests/test/_search -d '
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "message": "hello"
          }
        },
        {
          "range": {
            "message.word_count": {
              "gt": 1
            }
          }
        }
      ]
    }
  }
}

同样，如果在上述查询中将 hello 替换为 world，您将只会得到 hello world.

按令牌计数过滤搜索

search with filter by token count

token

elasticsearch