Elasticsearch _id 作为 MD5 哈希或文档字段

Question

互联网上有一些示例可以为 Elasticsearch 文档自定义 _id 字段，但是有没有办法生成多个字段的复合 _id。

示例数据

{
  "first_name": "john",
  "last_name": "doe",
  "dob": "1987-12-21",
  "phone": "7894456123".
  "so": "on"...
}

我如何配置索引管道以从前 4 个字段的连接生成 _id，对于被认为是复合主键的用例。

注意事项：

_id 有字符限制，但 4 个字段的连接可以随时超过限制。
使用某种单独的方式，因此不能有 2 个具有不同字段值但具有相同连接值的文档。

我考虑过使用像 MD5 和 SHA256 这样的散列算法，它可以从 "|".join(first,last,dob,phone) 生成固定长度的 _id。但无法在摄取管道中实施

这不是安全问题，因为我们只是试图定义一个主键，而索引是按月滚动的。

因此，如果我们能找到首选的存储高效 _id 值。

如果有其他方法可以实现用例，请提出建议。

Answer 1

输入 fingerprint ingest processor（自 ES 7.12.0 起）。

您可以使用该处理器定义摄取管道并按预期设置 _id 字段：

PUT _ingest/pipeline/id-fingerprint
{
  "processors": [
    {
      "fingerprint": {
        "fields": ["first_name", "last_name", "dob", "phone"],
        "target_field": "_id",
        "method": "MD5"
      }
    }
  ]
}

然后当您为文档编制索引时，您可以简单地引用该管道

PUT test/_doc/1?pipeline=id-fingerprint
{
  "first_name": "john",
  "last_name": "doe",
  "dob": "1987-12-21",
  "phone": "7894456123",
  "so": "on"
}

结果 =>

{
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "Xu28Onz3lbYCG0DrTTVp6Q==",      <--- the generated ID
    "_source" : {
      "phone" : "7894456123",
      "dob" : "1987-12-21",
      "last_name" : "doe",
      "so" : "on",
      "first_name" : "john"
    }
  }

Elasticsearch _id 作为 MD5 哈希或文档字段

Elasticsearch _id as MD5 hash or document fields

pipeline

elasticsearch

data-ingestion