AWS 更新 Athena meta:Glue Crawler vs MSCK Repair Table

AWS update Athena meta: Glue Crawler vs MSCK Repair Table

当新分区添加到 Athena table 时,我们可以使用 Glue Crawler 或 MSCK REPAIR TABLE 来更新元信息。他们的成本是多少?首选哪一个?

MSCK REPAIR TABLE 命令要求您的 S3 密钥将分区方案包含为 documented here。如果您的 S3 密钥不包含分区方案,MSCK REPAIR TABLE 命令将 return 缺少分区,但您仍然需要添加它们。另一个区别是 MSCK REPAIR TABLE 命令可以在30分钟后超时(默认Athena查询时间长度)而胶水爬虫不会。

这是定价信息:

Glue Crawler:

There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.

Pricing

For all AWS Regions where AWS Glue is available: [=15=].44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run

Athena:

There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries.

但是,除了这两个命令之外,您仍会产生 S3 费用。参考:AWS Athena: does `msck repair table` incur costs?

我的意见是,如果可以的话,最好在添加新数据后自己管理分区。

'ALTER TABLE database.table ADD
PARTITION (partition_name='PartitionValue') location 's3://bucket/path/partition'

如果被迫使用 Glue 或 Athena,我会评估哪种方式更适合您的流程。 MSCK REPAIR TABLE 命令可能更易于管理,但如果分区中有大量数据或分区不正确,您可能 运行 会遇到麻烦。此外,您还必须有一种方法来自动执行 运行 命令。 Glue Crawlers 可以配置触发器。

我同意手动添加分区。您可以通过@KiteCoder 的回答中的 Athena 查询 (ALTER TABLE ... ADD PARTITION () ...) 执行此操作,或者您可以直接通过 Glue API 执行此操作。

调用 Glue API 更冗长,但也更 'structured'。调用Athena明明是SQL查询,我知道有多少人鄙视写动态生成SQL.

的代码

具体操作为CreatePartition. It does require an object called StorageDescriptor which defines all the columns and data types in that table, but for an existing table you can retrieve that structure from the GetTable操作