Elasticsearch 摄取管道：如何递归修改 HashMap 中的值

Question

使用摄取管道，我想遍历 HashMap 并从所有字符串值（存在下划线的地方）中删除下划线，而键中的下划线保持不变。有些值是数组，必须进一步迭代才能进行相同的修改。

在管道中，我使用一个函数遍历和修改HashMap的Collection视图的值。

PUT /_ingest/pipeline/samples
{
    "description": "preprocessing of samples.json",
    "processors": [
        {
            "script": {
                "tag": "remove underscore from sample_tags values",
                "source": """
                    void findReplace(Collection collection) {
                    collection.forEach(element -> {
                        if (element instanceof String) {
                            element.replace('_',' ');
                        } else {
                            findReplace(element);
                        }
                        return true;
                        })
                    }

                    Collection samples = ctx.samples;
                    samples.forEach(sample -> { //sample.sample_tags is a HashMap
                        Collection sample_tags = sample.sample_tags.values();
                        findReplace(sample_tags);
                        return true;
                    })
                """
            }
        }
    ]
}

当我模拟管道摄取时，我发现字符串值没有被修改。我哪里错了？

POST /_ingest/pipeline/samples/_simulate
{
    "docs": [
        {
            "_index": "samples",
            "_id": "xUSU_3UB5CXFr25x7DcC",
            "_source": {
                "samples": [
                    {
                        "sample_tags": {
                            "Entry_A": [
                                "A_hyphentated-sample",
                                "sample1"
                            ],
                            "Entry_B": "A_multiple_underscore_example",
                            "Entry_C": [
                                        "sample2",
                                        "another_example_with_underscores"
                            ],
                            "Entry_E": "last_example"
                        }
                    }
                ]
            }
        }
    ]
}

\Result

{
  "docs" : [
    {
      "doc" : {
        "_index" : "samples",
        "_type" : "_doc",
        "_id" : "xUSU_3UB5CXFr25x7DcC",
        "_source" : {
          "samples" : [
            {
              "sample_tags" : {
                "Entry_E" : "last_example",
                "Entry_C" : [
                  "sample2",
                  "another_example_with_underscores"
                ],
                "Entry_B" : "A_multiple_underscore_example",
                "Entry_A" : [
                  "A_hyphentated-sample",
                  "sample1"
                ]
              }
            }
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-12-01T17:29:52.3917165Z"
        }
      }
    }
  ]
}

Answer 1

您走在正确的道路上，但您正在处理 值的副本，并且没有将修改后的值设置回文档上下文 ctx，这最终return从管道中编辑。这意味着您需要跟踪当前的迭代索引——对于数组列表、哈希映射以及介于两者之间的所有内容——这样您就可以在深层嵌套的上下文中定位字段的位置。

这是一个处理字符串和（仅字符串）数组列表的示例。您需要扩展它以处理散列映射（和其他类型），然后可能将整个过程提取到一个单独的函数中。但是 AFAIK 你不能在 Java 中 return 多种数据类型，所以它可能具有挑战性......

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          ArrayList samples = ctx.samples;
        
          for (int i = 0; i < samples.size(); i++) {
              def sample = samples.get(i).sample_tags;
              
              for (def entry : sample.entrySet()) {
                  def key = entry.getKey();
                  def val = entry.getValue();
                  def replaced_val;
                  
                  if (val instanceof String) {
                    replaced_val = val.replace('_',' ');
                  } else if (val instanceof ArrayList) {
                    replaced_val = new ArrayList();
                    for (int j = 0; j < val.length; j++) {
                        replaced_val.add(val[j].replace('_',' ')); 
                    }
                  } 
                  // else if (val instanceof HashMap) {
                    // do your thing
                  // }
                  
                  // crucial part
                  ctx.samples[i][key] = replaced_val;
              }
          }
        """
      }
    }
  ]
}

Answer 2

这是您的脚本的修改版本，可以处理您提供的数据：

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          String replaceString(String value) {
            return value.replace('_',' ');
          }
      
          void findReplace(Map map) {
            map.keySet().forEach(key -> {
              if (map[key] instanceof String) {
                  map[key] = replaceString(map[key]);
              } else {
                  map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
              }
            });
          }

          ctx.samples.forEach(sample -> {
              findReplace(sample.sample_tags);
              return true;
          });
          """
      }
    }
  ]
}

结果如下所示：

     {
      "samples" : [
        {
          "sample_tags" : {
            "Entry_E" : "last example",
            "Entry_C" : [
              "sample2",
              "another example with underscores"
            ],
            "Entry_B" : "A multiple underscore example",
            "Entry_A" : [
              "A hyphentated-sample",
              "sample1"
            ]
          }
        }
      ]
    }

Elasticsearch 摄取管道：如何递归修改 HashMap 中的值

Elasticsearch ingest pipeline: how to recursively modify values in a HashMap

pipeline

hashmap

elasticsearch

ingest