从字段数组中提取文本
extract text from field arrays
其中一个名为 "resources" 的字段具有以下 2 个内部文档。
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
我需要拆分 ARN 字段并获取它的最后一部分。即“reports_201706.schema”最好使用脚本字段。
我尝试过的:
1) 我检查了文件列表,发现只有 2 个条目 resources.accountId 和 resources.type
2) 我尝试使用日期时间字段,它在脚本文件选项(表达式)中正常工作。
doc['eventTime'].value
3) 但同样不适用于其他文本字段,例如
doc['eventType'].value
出现此错误:
"caused_by":{"type":"script_exception","reason":"link error","script_stack":["doc['eventType'].value","^---- HERE"],"script":"doc['eventType'].value","lang":"expression","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [eventType] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."}}},"status":500}
这意味着我需要更改映射。还有其他方法可以从对象中的嵌套数组中提取文本吗?
更新:
请在此处访问示例 kibana...
https://search-accountact-phhofxr23bjev4uscghwda4y7m.us-east-1.es.amazonaws.com/_plugin/kibana/
搜索 "ebs_attach.png",然后检查资源字段。你会看到 2 个这样的嵌套数组...
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::datameetgeo/ebs_attach.png"
},
{
"accountId": "513469704633",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::datameetgeo"
}
我需要拆分 ARN 字段并提取最后一部分 "ebs_attach.png"
如果我能以某种方式将其显示为脚本字段,那么我就可以在发现选项卡上并排看到存储桶名称和文件名。
更新 2
换句话说,我正在尝试将此图片中显示的文本提取为发现选项卡上的新字段。
注:认为"resources"是一种数组
NSArray *array_ARN_Values = [resources valueForKey:@"ARN"];
希望对你有用!!!
虽然您可以为此使用脚本,但我强烈建议您在索引时提取此类信息。我在这里提供了两个示例,它们远非故障安全(您需要使用不同的路径或根本缺少此字段进行测试),但它应该提供一个基础
PUT foo/bar/1
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# this is slow!!!
GET foo/_search
{
"script_fields": {
"document": {
"script": {
"inline": "return params._source.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
}
}
# Do this on index time, by adding a pipeline
PUT _ingest/pipeline/my-pipeline-id
{
"description" : "describe pipeline",
"processors" : [
{
"script" : {
"inline": "ctx.filename = ctx.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
]
}
# Store the document, specify the pipeline
PUT foo/bar/1?pipeline=my-pipeline-id
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# lets check the filename field of the indexed document by getting it
GET foo/bar/1
# We can even search for this file now
GET foo/_search
{
"query": {
"match": {
"filename": "reports_201706.schema"
}
}
}
其中一个名为 "resources" 的字段具有以下 2 个内部文档。
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
我需要拆分 ARN 字段并获取它的最后一部分。即“reports_201706.schema”最好使用脚本字段。
我尝试过的:
1) 我检查了文件列表,发现只有 2 个条目 resources.accountId 和 resources.type
2) 我尝试使用日期时间字段,它在脚本文件选项(表达式)中正常工作。
doc['eventTime'].value
3) 但同样不适用于其他文本字段,例如
doc['eventType'].value
出现此错误:
"caused_by":{"type":"script_exception","reason":"link error","script_stack":["doc['eventType'].value","^---- HERE"],"script":"doc['eventType'].value","lang":"expression","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [eventType] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."}}},"status":500}
这意味着我需要更改映射。还有其他方法可以从对象中的嵌套数组中提取文本吗?
更新:
请在此处访问示例 kibana...
https://search-accountact-phhofxr23bjev4uscghwda4y7m.us-east-1.es.amazonaws.com/_plugin/kibana/
搜索 "ebs_attach.png",然后检查资源字段。你会看到 2 个这样的嵌套数组...
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::datameetgeo/ebs_attach.png"
},
{
"accountId": "513469704633",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::datameetgeo"
}
我需要拆分 ARN 字段并提取最后一部分 "ebs_attach.png"
如果我能以某种方式将其显示为脚本字段,那么我就可以在发现选项卡上并排看到存储桶名称和文件名。
更新 2
换句话说,我正在尝试将此图片中显示的文本提取为发现选项卡上的新字段。
注:认为"resources"是一种数组
NSArray *array_ARN_Values = [resources valueForKey:@"ARN"];
希望对你有用!!!
虽然您可以为此使用脚本,但我强烈建议您在索引时提取此类信息。我在这里提供了两个示例,它们远非故障安全(您需要使用不同的路径或根本缺少此字段进行测试),但它应该提供一个基础
PUT foo/bar/1
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# this is slow!!!
GET foo/_search
{
"script_fields": {
"document": {
"script": {
"inline": "return params._source.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
}
}
# Do this on index time, by adding a pipeline
PUT _ingest/pipeline/my-pipeline-id
{
"description" : "describe pipeline",
"processors" : [
{
"script" : {
"inline": "ctx.filename = ctx.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
]
}
# Store the document, specify the pipeline
PUT foo/bar/1?pipeline=my-pipeline-id
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# lets check the filename field of the indexed document by getting it
GET foo/bar/1
# We can even search for this file now
GET foo/_search
{
"query": {
"match": {
"filename": "reports_201706.schema"
}
}
}