弹性搜索中的字段未按字母顺序排序

Question

我有几个文档中有一个名称字段。我正在使用名称字段的分析版本进行搜索，并使用 not_analyzed 进行排序。排序发生在一个级别，即名称首先按字母顺序排序。但是在字母列表中，名称是按字典顺序而不是按字母顺序排序的。这是我使用的映射：

{
  "mappings": {
    "seing": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }

任何人都可以提供相同的解决方案吗？

Answer 1

深入研究 Elasticsearch 文档，我偶然发现了这个：

Sorting and Collations

不区分大小写排序

Imagine that we have three user documents whose name fields contain Boffey, BROWN, and bailey, respectively. First we will apply the technique described in String Sorting and Multifields of using a not_analyzed field for sorting:

PUT /my_index
{
  "mappings": {
    "user": {
      "properties": {
        "name": {                    //1
          "type": "string",
          "fields": {
            "raw": {                 //2
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

analyzed name 字段用于搜索。
not_analyzedname.raw字段用于排序。

The preceding search request would return the documents in this order: BROWN, Boffey, bailey. This is known as lexicographical order as opposed to alphabetical order. Essentially, the bytes used to represent capital letters have a lower value than the bytes used to represent lowercase letters, and so the names are sorted with the lowest bytes first.

That may make sense to a computer, but doesn’t make much sense to human beings who would reasonably expect these names to be sorted alphabetically, regardless of case. To achieve this, we need to index each name in a way that the byte ordering corresponds to the sort order that we want.

In other words, we need an analyzer that will emit a single lowercase token:

按照这个逻辑，您需要使用自定义关键字分析器将其小写，而不是存储原始文档：

PUT /my_index
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "case_insensitive_sort" : {
          "tokenizer" : "keyword",
          "filter" : ["lowercase"]
        }
      }
    }
  },
  "mappings" : {
    "seing" : {
      "properties" : {
        "name" : {
          "type" : "string",
          "fields" : {
            "raw" : {
              "type" : "string",
              "analyzer" : "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

现在按 name.raw 排序应该按 字母顺序 排序，而不是 字典顺序 。

使用 Marvel 在我的本地机器上完成快速测试：

索引结构：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive_sort": {
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            },
            "keyword": {
              "type": "string",
              "analyzer": "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

测试数据：

PUT /my_index/user/1
{
  "name": "Tim"
}

PUT /my_index/user/2
{
  "name": "TOM"
}

使用原始字段查询：

POST /my_index/user/_search
{
  "sort": "name.raw"
}

结果：

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "TOM"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "Tim"
  ]
}

使用小写字符串查询：

POST /my_index/user/_search
{
  "sort": "name.keyword"
}

结果：

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "tim"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "tom"
  ]
}

我怀疑第二个结果在你的情况下是正确的。

Answer 2

从 Elastic 5.2 开始，您可以使用 normaliser 设置 case-insensitive 排序。

normalizer 属性 of keyword 字段类似于 analyzer 除了保证分析链生成单个令牌。

normalizer 在索引关键字之前以及在 search-time 当通过查询解析器搜索 keyword 字段时，例如 match 查询。

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT index/type/1
{
  "foo": "BÀR"
}

PUT index/type/2
{
  "foo": "bar"
}

PUT index/type/3
{
  "foo": "baz"
}

POST index/_refresh

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

上面的查询匹配文档 1 和 2，因为 BÀR 被转换为 bar 索引和查询时间。

{
  "took": $body.took,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "index",
        "_type": "type",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "foo": "bar"
        }
      },
      {
        "_index": "index",
        "_type": "type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "foo": "BÀR"
        }
      }
    ]
  }
}

此外，关键字在索引之前进行转换这一事实也意味着聚合 return 归一化值：

GET index/_search
{
  "size": 0,
  "aggs": {
    "foo_terms": {
      "terms": {
        "field": "foo"
      }
    }
  }
}

returns

{
  "took": 43,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "foo_terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "bar",
          "doc_count": 2
        },
        {
          "key": "baz",
          "doc_count": 1
        }
      ]
    }
  }
}

来源：Normaliser

Answer 3

只是一些基于@PiwEL 回答的附加组件。

如果您正在使用 nodejs version 的弹性搜索，您不能如果您的索引已经存在，那么您真的无法更新设置，如下所示：

await esClient.indices.putSettings({
      index,
      body: {
        analysis: {
          normalizer: {
            my_normalizer: {
              type: "custom",
              char_filter: [],
              filter: ["lowercase", "asciifolding"],
            },
          },
        },
      },
    });

这个API总是抛出异常。所以我最终在创建索引时创建了这个设置，因此必须删除并重新创建索引。更多详情可追溯here

因此，当我使用 NodeJs 客户端时，以下内容对我有用。

await esClient.indices.create({
  index,
  body: {
    settings: {
      analysis: {
        normalizer: {
          my_normalizer: {
            type: "custom",
            char_filter: [],
            filter: ["lowercase", "asciifolding"],
          },
        },
      },
    },
    mappings: {
      properties: {
        id: {
          type: "keyword",
        },
        name: {
          type: "text",
          fields: {
            keyword: {
              type: "keyword",
              normalizer: "my_normalizer",
            },
          },
        },
        seedless: {
          type: "boolean",
        },
        origin: { type: "text" },
      },
    },
  },
});

弹性搜索中的字段未按字母顺序排序

Fields not getting sorted in alphabetical order in elasticsearch

elasticsearch

elasticsearch-mapping

从 Elastic 5.2 开始，您可以使用 normaliser 设置 case-insensitive 排序。