为 amazon athena 分区 table

Question

我正在尝试按年、月和日对 amazon athena 查询的数据进行分区。但是，当我尝试从分区数据中查询时，我无法获得任何记录。我按照 blog post.

中的说明进行操作

创建 table 查询：

CREATE external TABLE mvc_test2 (
ROLE struct<Scope: string, Id: string>,
ACCOUNT struct<ClientId: string, Id: string, Name: string>,
USER struct<Id: string, Name: string>,
IsAuthenticated INT,
Device struct<IpAddress: string>,
Duration double,
Id string,
ResultMessage string,
Application struct<Version: string, Build: string, Name: string>,
Timestamp string,
ResultCode INT
)
Partitioned by(year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://firehose-titlesdesk-logs/Mvc/'

table创建成功，结果提示：

"Query successful. If your table has partitions, you need to load these partitions to be able to query data. You can either load all partitions or load them individually. If you use the load all partitions (MSCK REPAIR TABLE) command, partitions must be in a format understood by Hive. Learn more."

运行

msck repair table mvc_test2;

我得到结果：

"Partitions not in metastore: mvc_test2:2017/06/06/21 mvc_test2:2017/06/06/22"

此时，我尝试查询 table.

时没有得到任何结果

year/month/day/hour 以子文件夹格式存储日志。例如：'s3://firehose-application-logs/process/year/month/day/hour'

如何正确分区数据？

Answer 1

您的目录格式似乎是 2017/06/06/22。这与具有 year=2017/month=06/day=06/hour=22.

命名转换的 HIVE 分区不兼容

因此，您当前的数据格式使您无法使用分区。您需要重命名目录或（最好）通过 HIVE 处理您的数据以将其存储为正确的格式。

另请参阅：Analyzing Data in S3 using Amazon Athena

Answer 2

按日期添加每个分区。这种方式速度更快，可以为您节省更多的钱。仅加载您需要的分区，而不是所有分区。

ALTER TABLE mvc_test2 
ADD PARTITION (year='2017',month='06',day='06')
location 's3://firehose-titlesdesk-logs/Mvc/'

您可以根据需要通过更改年月and/or日来加载更多分区，只要确保它们有效即可。然后您可以检查以确保您的分区已通过运行此查询加载：

show partitions mvc_test2

Answer 3

AWS 现在支持 Athena Partition Projections，这将自动进行分区管理并在添加新数据时自动添加新分区

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table-partition-projection

为 amazon athena 分区 table

Partitioning table for amazon athena

hive

amazon-web-services

presto

amazon-athena