Elasticsearch

Question

我有一个临时索引，其中包含需要审核的文档。我想按这些文档包含的词对这些文档进行分组。

例如，我有这些文件：

1 - "aaa bbb ccc ddd eee fff"

2 - "bbb mmm aaa fff xxx"

3 - "hhh aaa fff"

所以，我想得到最流行的词，最好有计数："aaa" - 3，"fff" - 3，"bbb" - 2，等等

elasticsearch 可以吗？

Answer 1

做一个简单的词条聚合搜索就可以满足您的需求：

（其中 mydata 是您的字段名称）

curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
  "query": {
    "match_all" : {}
  },
  "aggs" : {
      "mydata_agg" : {
    "terms": {"field" : "mydata"}
    }
  }
}'

将return:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "mydata_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "aaa",
        "doc_count" : 3
      }, {
        "key" : "fff",
        "doc_count" : 3
      }, {
        "key" : "bbb",
        "doc_count" : 2
      }, {
        "key" : "ccc",
        "doc_count" : 1
      }, {
        "key" : "ddd",
        "doc_count" : 1
      }, {
        "key" : "eee",
        "doc_count" : 1
      }, {
        "key" : "hhh",
        "doc_count" : 1
      }, {
        "key" : "mmm",
        "doc_count" : 1
      }, {
        "key" : "xxx",
        "doc_count" : 1
      } ]
    }
  }
}

Answer 2

可能是因为这个问题和接受的答案都有些年头了，但现在有更好的方法了。

接受的答案没有考虑到这样一个事实，即最常见的词通常是无趣的，例如停用词，例如“the”、“a”、“in”、“for”等。

对于包含 text 而非 keyword 类型数据的字段通常是这种情况。

这就是为什么 ElasticSearch 实际上有一个专门用于此目的的聚合，称为 Significant Text Aggregation。
来自文档：

它专门设计用于 text 类型的字段
不需要字段数据或文档值
它会即时重新分析文本内容，这意味着它还可以过滤嘈杂文本的重复部分，否则这些部分往往会扭曲统计数据。

但是，它可能比其他类型的查询花费更长的时间，因此建议在使用 query.match 过滤数据或使用先前的 sampler 类型聚合后使用它。

因此，在您的情况下，您将发送这样的查询（省略 filtering/sampling）：

{
    "aggs": {
        "keywords": {
            "significant_text": {
                "field": "myfield"
            }
        }
    }
}

Elasticsearch - 如何获取文档的热门词列表

Elasticsearch - How to get popular words list of documents