使用 Cloudformation 将分区投影添加到 AWS Athena table
Add partition projection to AWS Athena table using Cloudformation
我有一个 Athena table 定义了一个模板,就像在 cloudformation 中这样指定的:
Cloudformation 创建
EventsTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref DatabaseName
TableInput:
Description: "My Table"
Name: !Ref TableName
TableType: EXTERNAL_TABLE
StorageDescriptor:
Compressed: True
Columns:
- Name: account_id
Type: string
Comment: "Account Id of the account making the request"
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: !Sub "s3://${EventsBucketName}/events/"
这很好用并且可以部署。我还发现我可以根据此 doc and this doc
创建分区投影
并且可以通过直接 table 创建使其工作,大致:
SQL 创建
CREATE EXTERNAL TABLE `performance_data.events`
(
`account_id` string,
...
)
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/events/'
TBLPROPERTIES (
'has_encrypted_data' = 'false',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.format' = 'yyyy/MM/dd',
'projection.day.range' = '2020/01/01,NOW',
'projection.day.interval' = '1',
'projection.day.interval.unit' = 'DAYS',
'storage.location.template' = 's3://my-bucket/events/${day}/'
)
但是我找不到转换成云形成结构的文档。所以我的问题是,如何在cloudformation中实现SQL代码中显示的分区投影?
GlueTableTableInput
参考CloudFormationreference,可以指定PartitionKeys
和Parameters
。这相当于查询中的 PARTITIONED BY
和 TBLPROPERTIES
。
编辑
举个例子,可以参考这个article。下面的示例显示了如何定义 PartitionKeys
以及如何为 Parameters
定义 JSON。在您的情况下,您只需添加投影键(例如 projection.enabled
)和值(true
)。
# Create an Amazon Glue table
CFNTableFlights:
# Creating the table waits for the database to be created
DependsOn: CFNDatabaseFlights
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableInput:
Name: !Ref CFNTableName1
Description: Define the first few columns of the flights table
TableType: EXTERNAL_TABLE
Parameters: {
"classification": "csv"
}
# ViewExpandedText: String
PartitionKeys:
# Data is partitioned by month
- Name: mon
Type: bigint
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: year
Type: bigint
- Name: quarter
Type: bigint
- Name: month
Type: bigint
- Name: day_of_month
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
我现在有了一个可行的解决方案。少了一块真的是少了一个参数,这里是解决方法:
MyTableResource:
Type: AWS::Glue::Table
Properties:
CatalogId: MyAccountId
DatabaseName: MyDatabase
TableInput:
Description: "My Table"
Name: mytable
TableType: EXTERNAL_TABLE
PartitionKeys:
- Name: day
Type: string
Comment: Day partition
Parameters:
"projection.enabled": "true"
"projection.day.type": "date"
"projection.day.format": "yyyy/MM/dd"
"projection.day.range": "2020/01/01,NOW"
"projection.day.interval": "1"
"projection.day.interval.unit": "DAYS"
"storage.location.template": "s3://my-bucket/events/${day}/"
StorageDescriptor:
Compressed: True
Columns:
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
Parameters:
serialization.format: '1'
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: "s3://my-bucket/events/"
关键的补充是:
serialization.format: '1'
这现在完全有效,可以使用分区进行查询:
select * from mytable where day > '2022/05/03'
我有一个 Athena table 定义了一个模板,就像在 cloudformation 中这样指定的:
Cloudformation 创建
EventsTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref DatabaseName
TableInput:
Description: "My Table"
Name: !Ref TableName
TableType: EXTERNAL_TABLE
StorageDescriptor:
Compressed: True
Columns:
- Name: account_id
Type: string
Comment: "Account Id of the account making the request"
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: !Sub "s3://${EventsBucketName}/events/"
这很好用并且可以部署。我还发现我可以根据此 doc and this doc
创建分区投影并且可以通过直接 table 创建使其工作,大致:
SQL 创建
CREATE EXTERNAL TABLE `performance_data.events`
(
`account_id` string,
...
)
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/events/'
TBLPROPERTIES (
'has_encrypted_data' = 'false',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.format' = 'yyyy/MM/dd',
'projection.day.range' = '2020/01/01,NOW',
'projection.day.interval' = '1',
'projection.day.interval.unit' = 'DAYS',
'storage.location.template' = 's3://my-bucket/events/${day}/'
)
但是我找不到转换成云形成结构的文档。所以我的问题是,如何在cloudformation中实现SQL代码中显示的分区投影?
GlueTableTableInput
参考CloudFormationreference,可以指定PartitionKeys
和Parameters
。这相当于查询中的 PARTITIONED BY
和 TBLPROPERTIES
。
编辑
举个例子,可以参考这个article。下面的示例显示了如何定义 PartitionKeys
以及如何为 Parameters
定义 JSON。在您的情况下,您只需添加投影键(例如 projection.enabled
)和值(true
)。
# Create an Amazon Glue table
CFNTableFlights:
# Creating the table waits for the database to be created
DependsOn: CFNDatabaseFlights
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CFNDatabaseName
TableInput:
Name: !Ref CFNTableName1
Description: Define the first few columns of the flights table
TableType: EXTERNAL_TABLE
Parameters: {
"classification": "csv"
}
# ViewExpandedText: String
PartitionKeys:
# Data is partitioned by month
- Name: mon
Type: bigint
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: year
Type: bigint
- Name: quarter
Type: bigint
- Name: month
Type: bigint
- Name: day_of_month
Type: bigint
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://crawler-public-us-east-1/flight/2016/csv/
SerdeInfo:
Parameters:
field.delim: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
我现在有了一个可行的解决方案。少了一块真的是少了一个参数,这里是解决方法:
MyTableResource:
Type: AWS::Glue::Table
Properties:
CatalogId: MyAccountId
DatabaseName: MyDatabase
TableInput:
Description: "My Table"
Name: mytable
TableType: EXTERNAL_TABLE
PartitionKeys:
- Name: day
Type: string
Comment: Day partition
Parameters:
"projection.enabled": "true"
"projection.day.type": "date"
"projection.day.format": "yyyy/MM/dd"
"projection.day.range": "2020/01/01,NOW"
"projection.day.interval": "1"
"projection.day.interval.unit": "DAYS"
"storage.location.template": "s3://my-bucket/events/${day}/"
StorageDescriptor:
Compressed: True
Columns:
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
Parameters:
serialization.format: '1'
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: "s3://my-bucket/events/"
关键的补充是:
serialization.format: '1'
这现在完全有效,可以使用分区进行查询:
select * from mytable where day > '2022/05/03'