Elasticsearch 摄取管道:如何递归修改 HashMap 中的值
Elasticsearch ingest pipeline: how to recursively modify values in a HashMap
使用摄取管道,我想遍历 HashMap 并从所有字符串值(存在下划线的地方)中删除下划线,而键中的下划线保持不变。有些值是数组,必须进一步迭代才能进行相同的修改。
在管道中,我使用一个函数遍历和修改HashMap的Collection视图的值。
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
void findReplace(Collection collection) {
collection.forEach(element -> {
if (element instanceof String) {
element.replace('_',' ');
} else {
findReplace(element);
}
return true;
})
}
Collection samples = ctx.samples;
samples.forEach(sample -> { //sample.sample_tags is a HashMap
Collection sample_tags = sample.sample_tags.values();
findReplace(sample_tags);
return true;
})
"""
}
}
]
}
当我模拟管道摄取时,我发现字符串值没有被修改。我哪里错了?
POST /_ingest/pipeline/samples/_simulate
{
"docs": [
{
"_index": "samples",
"_id": "xUSU_3UB5CXFr25x7DcC",
"_source": {
"samples": [
{
"sample_tags": {
"Entry_A": [
"A_hyphentated-sample",
"sample1"
],
"Entry_B": "A_multiple_underscore_example",
"Entry_C": [
"sample2",
"another_example_with_underscores"
],
"Entry_E": "last_example"
}
}
]
}
}
]
}
\Result
{
"docs" : [
{
"doc" : {
"_index" : "samples",
"_type" : "_doc",
"_id" : "xUSU_3UB5CXFr25x7DcC",
"_source" : {
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last_example",
"Entry_C" : [
"sample2",
"another_example_with_underscores"
],
"Entry_B" : "A_multiple_underscore_example",
"Entry_A" : [
"A_hyphentated-sample",
"sample1"
]
}
}
]
},
"_ingest" : {
"timestamp" : "2020-12-01T17:29:52.3917165Z"
}
}
}
]
}
您走在正确的道路上,但您正在处理 值的副本,并且没有将修改后的值设置回文档上下文 ctx
,这最终return从管道中编辑。这意味着您需要跟踪当前的迭代索引——对于数组列表、哈希映射以及介于两者之间的所有内容——这样您就可以在深层嵌套的上下文中定位字段的位置。
这是一个处理字符串和(仅字符串)数组列表的示例。您需要扩展它以处理散列映射(和其他类型),然后可能将整个过程提取到一个单独的函数中。但是 AFAIK 你不能在 Java 中 return 多种数据类型,所以它可能具有挑战性......
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
ArrayList samples = ctx.samples;
for (int i = 0; i < samples.size(); i++) {
def sample = samples.get(i).sample_tags;
for (def entry : sample.entrySet()) {
def key = entry.getKey();
def val = entry.getValue();
def replaced_val;
if (val instanceof String) {
replaced_val = val.replace('_',' ');
} else if (val instanceof ArrayList) {
replaced_val = new ArrayList();
for (int j = 0; j < val.length; j++) {
replaced_val.add(val[j].replace('_',' '));
}
}
// else if (val instanceof HashMap) {
// do your thing
// }
// crucial part
ctx.samples[i][key] = replaced_val;
}
}
"""
}
}
]
}
这是您的脚本的修改版本,可以处理您提供的数据:
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
String replaceString(String value) {
return value.replace('_',' ');
}
void findReplace(Map map) {
map.keySet().forEach(key -> {
if (map[key] instanceof String) {
map[key] = replaceString(map[key]);
} else {
map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
}
});
}
ctx.samples.forEach(sample -> {
findReplace(sample.sample_tags);
return true;
});
"""
}
}
]
}
结果如下所示:
{
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last example",
"Entry_C" : [
"sample2",
"another example with underscores"
],
"Entry_B" : "A multiple underscore example",
"Entry_A" : [
"A hyphentated-sample",
"sample1"
]
}
}
]
}
使用摄取管道,我想遍历 HashMap 并从所有字符串值(存在下划线的地方)中删除下划线,而键中的下划线保持不变。有些值是数组,必须进一步迭代才能进行相同的修改。
在管道中,我使用一个函数遍历和修改HashMap的Collection视图的值。
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
void findReplace(Collection collection) {
collection.forEach(element -> {
if (element instanceof String) {
element.replace('_',' ');
} else {
findReplace(element);
}
return true;
})
}
Collection samples = ctx.samples;
samples.forEach(sample -> { //sample.sample_tags is a HashMap
Collection sample_tags = sample.sample_tags.values();
findReplace(sample_tags);
return true;
})
"""
}
}
]
}
当我模拟管道摄取时,我发现字符串值没有被修改。我哪里错了?
POST /_ingest/pipeline/samples/_simulate
{
"docs": [
{
"_index": "samples",
"_id": "xUSU_3UB5CXFr25x7DcC",
"_source": {
"samples": [
{
"sample_tags": {
"Entry_A": [
"A_hyphentated-sample",
"sample1"
],
"Entry_B": "A_multiple_underscore_example",
"Entry_C": [
"sample2",
"another_example_with_underscores"
],
"Entry_E": "last_example"
}
}
]
}
}
]
}
\Result
{
"docs" : [
{
"doc" : {
"_index" : "samples",
"_type" : "_doc",
"_id" : "xUSU_3UB5CXFr25x7DcC",
"_source" : {
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last_example",
"Entry_C" : [
"sample2",
"another_example_with_underscores"
],
"Entry_B" : "A_multiple_underscore_example",
"Entry_A" : [
"A_hyphentated-sample",
"sample1"
]
}
}
]
},
"_ingest" : {
"timestamp" : "2020-12-01T17:29:52.3917165Z"
}
}
}
]
}
您走在正确的道路上,但您正在处理 值的副本,并且没有将修改后的值设置回文档上下文 ctx
,这最终return从管道中编辑。这意味着您需要跟踪当前的迭代索引——对于数组列表、哈希映射以及介于两者之间的所有内容——这样您就可以在深层嵌套的上下文中定位字段的位置。
这是一个处理字符串和(仅字符串)数组列表的示例。您需要扩展它以处理散列映射(和其他类型),然后可能将整个过程提取到一个单独的函数中。但是 AFAIK 你不能在 Java 中 return 多种数据类型,所以它可能具有挑战性......
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
ArrayList samples = ctx.samples;
for (int i = 0; i < samples.size(); i++) {
def sample = samples.get(i).sample_tags;
for (def entry : sample.entrySet()) {
def key = entry.getKey();
def val = entry.getValue();
def replaced_val;
if (val instanceof String) {
replaced_val = val.replace('_',' ');
} else if (val instanceof ArrayList) {
replaced_val = new ArrayList();
for (int j = 0; j < val.length; j++) {
replaced_val.add(val[j].replace('_',' '));
}
}
// else if (val instanceof HashMap) {
// do your thing
// }
// crucial part
ctx.samples[i][key] = replaced_val;
}
}
"""
}
}
]
}
这是您的脚本的修改版本,可以处理您提供的数据:
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
String replaceString(String value) {
return value.replace('_',' ');
}
void findReplace(Map map) {
map.keySet().forEach(key -> {
if (map[key] instanceof String) {
map[key] = replaceString(map[key]);
} else {
map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
}
});
}
ctx.samples.forEach(sample -> {
findReplace(sample.sample_tags);
return true;
});
"""
}
}
]
}
结果如下所示:
{
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last example",
"Entry_C" : [
"sample2",
"another example with underscores"
],
"Entry_B" : "A multiple underscore example",
"Entry_A" : [
"A hyphentated-sample",
"sample1"
]
}
}
]
}