从 Glue Cloudformation 模板对 Athena 表进行分区
Partitioning Athena Tables from Glue Cloudformation template
根据 S3 中的文件夹结构使用 AWS::Glue::Table, you can set up an Athena table like . Athena supports partitioning data。我想从我的 Glue 模板中分离出我的 Athena table。
从 AWS Glue Table TableInput 看来,我可以使用 PartitionKeys
对我的数据进行分区,但是当我尝试使用以下模板时,Athena 失败并且无法获取任何数据。
Resources:
...
MyGlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref MyGlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: my-glue-table
Parameters: { "classification" : "json" }
PartitionKeys:
- {Name: dt, Type: string}
StorageDescriptor:
Location: "s3://elasticmapreduce/samples/hive-ads/tables/impressions/"
InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
SerdeInfo:
Parameters: { "separatorChar" : "," }
SerializationLibrary: "org.apache.hive.hcatalog.data.JsonSerDe"
StoredAsSubDirectories: false
Columns:
- {Name: requestBeginTime, Type: string}
- {Name: adId, Type: string}
- {Name: impressionId, Type: string}
- {Name: referrer, Type: string}
- {Name: userAgent, Type: string}
- {Name: userCookie, Type: string}
- {Name: ip, Type: string}
- {Name: number, Type: string}
- {Name: processId, Type: string}
- {Name: browserCookie, Type: string}
- {Name: requestEndTime, Type: string}
- {Name: timers, Type: "struct<modellookup:string,requesttime:string>"}
- {Name: threadId, Type: string}
- {Name: hostname, Type: string}
- {Name: sessionId, Type: string}
如何在 AWS Glue 中对数据进行分区?
知道了!非常痛苦,因为我必须 运行 Glue Crawler 正确创建 table 分区,然后使用 CLI 提取正确的模板参数。这是模板,
AWSTemplateFormatVersion: 2010-09-09
Description: A partitioned Glue Table
Resources:
MyGlueDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: my_glue_database
Description: "Glue beats tape"
CatalogId: !Ref AWS::AccountId
MyGlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref MyGlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: my_glue_table
TableType: EXTERNAL_TABLE
Parameters:
CrawlerSchemaDeserializerVersion': "1.0"
CrawlerSchemaSerializerVersion': "1.0"
classification': json
compressionType': none
typeOfData': file
PartitionKeys:
- {Name: dt, Type: string}
StorageDescriptor:
BucketColumns: []
Columns:
- {Name: number, Type: string}
- {Name: referrer, Type: string}
- {Name: processid, Type: string}
- {Name: adid, Type: string}
- {Name: browsercookie, Type: string}
- {Name: usercookie, Type: string}
- {Name: requestendtime, Type: string}
- {Name: impressionid, Type: string}
- {Name: useragent, Type: string}
- {Name: timers, Type: 'struct<modelLookup:string,requestTime:string>'}
- {Name: threadid, Type: string}
- {Name: ip, Type: string}
- {Name: modelid, Type: string}
- {Name: hostname, Type: string}
- {Name: sessionid, Type: string}
- {Name: requestbegintime, Type: string}
Compressed: false
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://elasticmapreduce/samples/hive-ads/tables/impressions/
NumberOfBuckets: -1
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Parameters: {CrawlerSchemaDeserializerVersion: '1.0', CrawlerSchemaSerializerVersion: '1.0',
UPDATED_BY_CRAWLER: test, averageRecordSize: '644', classification: json,
compressionType: none, objectCount: '241', recordCount: '1000109', sizeKey: '648533598',
typeOfData: file}
SerdeInfo:
Parameters: {paths: 'adId,browserCookie,hostname,impressionId,ip,modelId,number,processId,referrer,requestBeginTime,requestEndTime,sessionId,threadId,timers,userAgent,userCookie'}
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
SortColumns: []
StoredAsSubDirectories: false
然后Cloudformation会部署table,你需要运行。
MSCK REPAIR TABLE my_glue_table;
这将添加所有分区,您将在输出中看到,
Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-00
Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-05
然后你可以 运行 SQL 在此分区之上
%% SELECT * FROM "my_glue_database"."my_glue_table" WHERE dt = '2009-04-14-13-00' LIMIT 10;
1 7663 cartoonnetwork.com 1178 SxRBJCmJBCLcfTS545t6qD1M8L64SC nsdfvfvger 3VCLfFfF75BDgHgDoowHegOpkCivMJ 1239714024000 RTM6Vtrc1O3KX2FlUghUSiAQHiix8F Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0) {modellookup=0.3538, requesttime=0.7532} 15 37.215.88.35 bxxiuxduad ec2-50-32-48-14.amazon.com BIBIlA7dgXc2eWekUJ6hSXa7p6dQEx 1239714024000 2009-04-14-13-00
2 17646 coursera.org 1255 Fskm4W6JKX6vf7UMaW55KObTJCtm1E xftjotkexc jH6DRWtkeH3tVg6c4mcLW36UW3LvqX 1239714027000 uQqO1fNoeM8KdesiVg86o4iK7FkqLt Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322) {modellookup=0.2986, requesttime=0.9616} 21 37.218.101.204 bxxiuxduad ec2-50-32-48-14.amazon.com OjgTQWOqHJopoWf9LpJ4We1UE7uJao 1239714026000 2009-04-14-13-00
根据 S3 中的文件夹结构使用 AWS::Glue::Table, you can set up an Athena table like
从 AWS Glue Table TableInput 看来,我可以使用 PartitionKeys
对我的数据进行分区,但是当我尝试使用以下模板时,Athena 失败并且无法获取任何数据。
Resources:
...
MyGlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref MyGlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: my-glue-table
Parameters: { "classification" : "json" }
PartitionKeys:
- {Name: dt, Type: string}
StorageDescriptor:
Location: "s3://elasticmapreduce/samples/hive-ads/tables/impressions/"
InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
SerdeInfo:
Parameters: { "separatorChar" : "," }
SerializationLibrary: "org.apache.hive.hcatalog.data.JsonSerDe"
StoredAsSubDirectories: false
Columns:
- {Name: requestBeginTime, Type: string}
- {Name: adId, Type: string}
- {Name: impressionId, Type: string}
- {Name: referrer, Type: string}
- {Name: userAgent, Type: string}
- {Name: userCookie, Type: string}
- {Name: ip, Type: string}
- {Name: number, Type: string}
- {Name: processId, Type: string}
- {Name: browserCookie, Type: string}
- {Name: requestEndTime, Type: string}
- {Name: timers, Type: "struct<modellookup:string,requesttime:string>"}
- {Name: threadId, Type: string}
- {Name: hostname, Type: string}
- {Name: sessionId, Type: string}
如何在 AWS Glue 中对数据进行分区?
知道了!非常痛苦,因为我必须 运行 Glue Crawler 正确创建 table 分区,然后使用 CLI 提取正确的模板参数。这是模板,
AWSTemplateFormatVersion: 2010-09-09
Description: A partitioned Glue Table
Resources:
MyGlueDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: my_glue_database
Description: "Glue beats tape"
CatalogId: !Ref AWS::AccountId
MyGlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref MyGlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
Name: my_glue_table
TableType: EXTERNAL_TABLE
Parameters:
CrawlerSchemaDeserializerVersion': "1.0"
CrawlerSchemaSerializerVersion': "1.0"
classification': json
compressionType': none
typeOfData': file
PartitionKeys:
- {Name: dt, Type: string}
StorageDescriptor:
BucketColumns: []
Columns:
- {Name: number, Type: string}
- {Name: referrer, Type: string}
- {Name: processid, Type: string}
- {Name: adid, Type: string}
- {Name: browsercookie, Type: string}
- {Name: usercookie, Type: string}
- {Name: requestendtime, Type: string}
- {Name: impressionid, Type: string}
- {Name: useragent, Type: string}
- {Name: timers, Type: 'struct<modelLookup:string,requestTime:string>'}
- {Name: threadid, Type: string}
- {Name: ip, Type: string}
- {Name: modelid, Type: string}
- {Name: hostname, Type: string}
- {Name: sessionid, Type: string}
- {Name: requestbegintime, Type: string}
Compressed: false
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://elasticmapreduce/samples/hive-ads/tables/impressions/
NumberOfBuckets: -1
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Parameters: {CrawlerSchemaDeserializerVersion: '1.0', CrawlerSchemaSerializerVersion: '1.0',
UPDATED_BY_CRAWLER: test, averageRecordSize: '644', classification: json,
compressionType: none, objectCount: '241', recordCount: '1000109', sizeKey: '648533598',
typeOfData: file}
SerdeInfo:
Parameters: {paths: 'adId,browserCookie,hostname,impressionId,ip,modelId,number,processId,referrer,requestBeginTime,requestEndTime,sessionId,threadId,timers,userAgent,userCookie'}
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
SortColumns: []
StoredAsSubDirectories: false
然后Cloudformation会部署table,你需要运行。
MSCK REPAIR TABLE my_glue_table;
这将添加所有分区,您将在输出中看到,
Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-00
Repair: Added partition to metastore my_glue_table:dt=2009-04-12-13-05
然后你可以 运行 SQL 在此分区之上
%% SELECT * FROM "my_glue_database"."my_glue_table" WHERE dt = '2009-04-14-13-00' LIMIT 10;
1 7663 cartoonnetwork.com 1178 SxRBJCmJBCLcfTS545t6qD1M8L64SC nsdfvfvger 3VCLfFfF75BDgHgDoowHegOpkCivMJ 1239714024000 RTM6Vtrc1O3KX2FlUghUSiAQHiix8F Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0) {modellookup=0.3538, requesttime=0.7532} 15 37.215.88.35 bxxiuxduad ec2-50-32-48-14.amazon.com BIBIlA7dgXc2eWekUJ6hSXa7p6dQEx 1239714024000 2009-04-14-13-00
2 17646 coursera.org 1255 Fskm4W6JKX6vf7UMaW55KObTJCtm1E xftjotkexc jH6DRWtkeH3tVg6c4mcLW36UW3LvqX 1239714027000 uQqO1fNoeM8KdesiVg86o4iK7FkqLt Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322) {modellookup=0.2986, requesttime=0.9616} 21 37.218.101.204 bxxiuxduad ec2-50-32-48-14.amazon.com OjgTQWOqHJopoWf9LpJ4We1UE7uJao 1239714026000 2009-04-14-13-00