弹性搜索聚合桶将电子邮件格式计数为两个不同的桶键。

Question

我将字段存储为 "user1@user.com " .

使用聚合json查询：

"aggregations": {
                "email-terms": {
                    "terms": {
                        "field": "l_obj.email",
                        "size": 0,
                        "shard_size": 0,
                        "order": {
                            "_count": "desc"
                        }
                    }
                }
            }


I am getting response :

"buckets" : [
{
"key" : "user.com",
"doc_count" : 1
},
{
"key" : "user1",
"doc_count" : 1
}

而不是

"buckets" : [
{
"key" : "user1@user.com",
"doc_count" : 1
}
]

同样的问题仍然存在于字符串类型 likes : user1.user2.user.com ，我正在做术语聚合。我在这里遗漏了什么吗？

Answer 1

您需要在映射的 "email" 字段上设置 "index": "not_analyzed"。

如果我在没有指定分析器（或不使用分析器）的情况下设置玩具索引，将使用 standard analyzer，它将按空格和“@”等符号拆分。因此，使用此索引定义：

PUT /test_index
{
   "mappings": {
      "doc": {
         "properties": {
            "email": {
               "type": "string"
            }
         }
      }
   }
}

如果我添加一个文档：

PUT /test_index/doc/1
{
    "email": "user1@user.com"
}

然后要求terms聚合，我得到两个术语：

POST /test_index/_search?search_type=count
{
   "aggregations": {
      "email-terms": {
         "terms": {
            "field": "email"
         }
      }
   }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "email-terms": {
         "buckets": [
            {
               "key": "user.com",
               "doc_count": 1
            },
            {
               "key": "user1",
               "doc_count": 1
            }
         ]
      }
   }
}

但是如果我在该字段中用 "index": "not_analyzed" 重建索引，并再次索引同一个文档：

DELETE /test_index

PUT /test_index
{
   "mappings": {
      "doc": {
         "properties": {
            "email": {
               "type": "string",
               "index": "not_analyzed"
            }
         }
      }
   }
}

PUT /test_index/doc/1
{
    "email": "user1@user.com"
}

和运行相同的术语聚合，我只得到那个电子邮件地址的一个术语：

POST /test_index/_search?search_type=count
{
   "aggregations": {
      "email-terms": {
         "terms": {
            "field": "email"
         }
      }
   }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "email-terms": {
         "buckets": [
            {
               "key": "user1@user.com",
               "doc_count": 1
            }
         ]
      }
   }
}

这是我使用的代码，一共：

http://sense.qbox.io/gist/a73a28bf7450b637138b02a371fb15cabf344ab6

Answer 2

我们可以使用索引模板来预定义字段类型，http://www.elastic.co/guide/en/elasticsearch/reference/1.3/indices-templates.html ，例如：

使用rest client或者elastic search sense

PUT/POST http://escluster:port/_template

{
  "testtemplate": {
    "aliases": {},
    "mappings": {
      "test1": {
        "_all": {
          "enabled": false
        },
        "_source": {
          "enabled": true
        },
        "properties": {
          "email": {
            "fielddata": {
              "format": "doc_values"
            },
            "index": "not_analyzed",
            "type": "string"
          }...

弹性搜索聚合桶将电子邮件格式计数为两个不同的桶键。

Elastic search aggregations buckets counting email format as two different bucket key .

elasticsearch