AWS Glue 自定义分类器 Json 路径
AWS Glue Custom Classifiers Json Path
我有一组 Json 个数据文件,如下所示
[
{"client":"toys",
"filename":"toy1.csv",
"file_row_number":1,
"secondary_db_index":"4050",
"processed_timestamp":1535004075,
"processed_datetime":"2018-08-23T06:01:15+0000",
"entity_id":"4050",
"entity_name":"4050",
"is_emailable":false,
"is_txtable":false,
"is_loadable":false}
]
我已经使用以下自定义分类器创建了一个 Glue Crawler Json 路径
$[*]
用正确识别的列粘贴 returns 正确的模式。
但是,当我在 Athena 上查询数据时...所有数据都在第一列中,其余列为空。
如何让数据按列分布?
image of Athena query
谢谢!
这是与 Hive 相关的问题。我建议两种方法。首先,您可以在 Athena 中使用如下结构数据类型创建新的 table:
CREATE EXTERNAL TABLE `example`(
`row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='example',
'averageRecordSize'='271',
'classification'='json',
'compressionType'='none',
'jsonPath'='$[*]',
'objectCount'='1',
'recordCount'='1',
'sizeKey'='271',
'transient_lastDdlTime'='1535533583',
'typeOfData'='file')
然后就可以运行查询如下:
SELECT row.client, row.filename, row.file_row_number FROM "example"
其次,您可以如下重新设计 json 文件,然后再次 运行 爬虫。在这个例子中,我使用了 Single-JSON-Record-Per-Line 格式。
{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}
我有一组 Json 个数据文件,如下所示
[
{"client":"toys",
"filename":"toy1.csv",
"file_row_number":1,
"secondary_db_index":"4050",
"processed_timestamp":1535004075,
"processed_datetime":"2018-08-23T06:01:15+0000",
"entity_id":"4050",
"entity_name":"4050",
"is_emailable":false,
"is_txtable":false,
"is_loadable":false}
]
我已经使用以下自定义分类器创建了一个 Glue Crawler Json 路径
$[*]
用正确识别的列粘贴 returns 正确的模式。
但是,当我在 Athena 上查询数据时...所有数据都在第一列中,其余列为空。
如何让数据按列分布?
image of Athena query
谢谢!
这是与 Hive 相关的问题。我建议两种方法。首先,您可以在 Athena 中使用如下结构数据类型创建新的 table:
CREATE EXTERNAL TABLE `example`(
`row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='example',
'averageRecordSize'='271',
'classification'='json',
'compressionType'='none',
'jsonPath'='$[*]',
'objectCount'='1',
'recordCount'='1',
'sizeKey'='271',
'transient_lastDdlTime'='1535533583',
'typeOfData'='file')
然后就可以运行查询如下:
SELECT row.client, row.filename, row.file_row_number FROM "example"
其次,您可以如下重新设计 json 文件,然后再次 运行 爬虫。在这个例子中,我使用了 Single-JSON-Record-Per-Line 格式。
{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}