如何使用 AWS Athena - Presto 从 NESTED JSON 中的特定字段中提取数据?
How to extract data from specific fields in a NESTED JSON using AWS Athena - Presto?
我在 S3 存储桶中有以下格式的 JSON,我试图从 "fields" 键中仅提取 "id"、"label" 和 "value"使用雅典娜。我试过 ARRAY-MAP 但没有成功。此外,在 "value" 字段上 - 我希望将内容捕获为简单文本,忽略其中的任何列表/词典。
我也不想为这些 JSON 创建任何 Hive 模式,如果可能,我正在寻找 Presto SQL 解决方案。
{
"reports":{
"client":{
"pdf":"https://reports.s3-accelerate.amazonaws.com/looks/123/reports/client.pdf",
"html":"https://api.com/looks/123/reports/client.html"
},
"public":{
"pdf":"https://s3.amazonaws.com/reports.com/looks/123/reports/public.pdf",
"html":"https://api.look.com/looks/123/reports/public.html"
}
},
"actors":{
"looker":{
"firstName":"Rosa",
"lastName":"Mart"
},
"client":{
"email":"XXX.XXX@XXXXXX.com",
"firstName":"XXX",
"lastName":"XXX"
}
},
"_id":"123",
"fields":[
{
"id":"fence_condition_missing_sections",
"context":[
"Fence Condition"
],
"label":"Missing Sections",
"type":"choice",
"value":"None"
},
{
"id":"photos_landscaped_area",
"context":[
"Landscaping Photos"
],
"label":"Landscaped Area",
"type":"photo-with-description",
"value":[
{
"description":"Front",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/1.jpg"
},
{
"description":"Front entrance ",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/2.jpg"
}
]
}
],
"jobNumber":"xxx",
"createdAt":"2018-10-11T22:39:37.223Z",
"completedAt":"2018-01-27T20:13:49.937Z",
"inspectedAt":"2018-01-21T23:33:48.718Z",
"type":"ZZZ-commercial",
"name":"Commercial"
}'
预期输出:
--------------------------------------------------------------------------------
| ID | LABEL | VALUE |
--------------------------------------------------------------------------------
| photos_landscaped_area | Landscaped Area | [{"description":"Front",...}] |
----------------------------------------------------------------------------
| fence_condition_missing_sections | Missing Sections | None|
----------------------------------------------------------------------------
我将假设您的数据采用每行一个文档的格式,并且您为了便于阅读而提供了一个格式化示例。如果这不正确,请参阅问题 Multi-line JSON file querying in hive
.
当 JSON 文档的架构不完全规则时,您可以将该列创建为 string
列并使用 JSON_*
函数从中提取值。
首先你需要为原始数据创建一个table:
CREATE TABLE data (
fields array<struct<id:string,label:string,value:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://…'
(如果您对 JSON 文档中的其他字段不感兴趣,您可以在创建 table 时忽略这些字段)
然后创建一个展平数据的视图:
CREATE VIEW flat_data AS
SELECT
field.id,
field.label,
field.value
FROM data
CROSS JOIN UNNEST(fields) AS f(field)
从此视图中进行选择应该会为您提供所需的结果。
我怀疑您也在寻找如何从 values
结构中提取属性,这就是我在上面提到的:
SELECT
label,
JSON_EXTRACT(value, '$.photo') AS photo_urls
FROM flat_data
WHERE id = 'photos_landscaped_area'
在 Presto 文档中查找所有可用的 JSON functions。
我在 S3 存储桶中有以下格式的 JSON,我试图从 "fields" 键中仅提取 "id"、"label" 和 "value"使用雅典娜。我试过 ARRAY-MAP 但没有成功。此外,在 "value" 字段上 - 我希望将内容捕获为简单文本,忽略其中的任何列表/词典。
我也不想为这些 JSON 创建任何 Hive 模式,如果可能,我正在寻找 Presto SQL 解决方案。
{
"reports":{
"client":{
"pdf":"https://reports.s3-accelerate.amazonaws.com/looks/123/reports/client.pdf",
"html":"https://api.com/looks/123/reports/client.html"
},
"public":{
"pdf":"https://s3.amazonaws.com/reports.com/looks/123/reports/public.pdf",
"html":"https://api.look.com/looks/123/reports/public.html"
}
},
"actors":{
"looker":{
"firstName":"Rosa",
"lastName":"Mart"
},
"client":{
"email":"XXX.XXX@XXXXXX.com",
"firstName":"XXX",
"lastName":"XXX"
}
},
"_id":"123",
"fields":[
{
"id":"fence_condition_missing_sections",
"context":[
"Fence Condition"
],
"label":"Missing Sections",
"type":"choice",
"value":"None"
},
{
"id":"photos_landscaped_area",
"context":[
"Landscaping Photos"
],
"label":"Landscaped Area",
"type":"photo-with-description",
"value":[
{
"description":"Front",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/1.jpg"
},
{
"description":"Front entrance ",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/2.jpg"
}
]
}
],
"jobNumber":"xxx",
"createdAt":"2018-10-11T22:39:37.223Z",
"completedAt":"2018-01-27T20:13:49.937Z",
"inspectedAt":"2018-01-21T23:33:48.718Z",
"type":"ZZZ-commercial",
"name":"Commercial"
}'
预期输出:
--------------------------------------------------------------------------------
| ID | LABEL | VALUE |
--------------------------------------------------------------------------------
| photos_landscaped_area | Landscaped Area | [{"description":"Front",...}] |
----------------------------------------------------------------------------
| fence_condition_missing_sections | Missing Sections | None|
----------------------------------------------------------------------------
我将假设您的数据采用每行一个文档的格式,并且您为了便于阅读而提供了一个格式化示例。如果这不正确,请参阅问题 Multi-line JSON file querying in hive .
当 JSON 文档的架构不完全规则时,您可以将该列创建为 string
列并使用 JSON_*
函数从中提取值。
首先你需要为原始数据创建一个table:
CREATE TABLE data (
fields array<struct<id:string,label:string,value:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://…'
(如果您对 JSON 文档中的其他字段不感兴趣,您可以在创建 table 时忽略这些字段)
然后创建一个展平数据的视图:
CREATE VIEW flat_data AS
SELECT
field.id,
field.label,
field.value
FROM data
CROSS JOIN UNNEST(fields) AS f(field)
从此视图中进行选择应该会为您提供所需的结果。
我怀疑您也在寻找如何从 values
结构中提取属性,这就是我在上面提到的:
SELECT
label,
JSON_EXTRACT(value, '$.photo') AS photo_urls
FROM flat_data
WHERE id = 'photos_landscaped_area'
在 Presto 文档中查找所有可用的 JSON functions。