vpc 流日志的分区

Question

此查询按预期工作。

CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs3 (
  version int,
  account string,
  interfaceid string,
  sourceaddress string,
  destinationaddress string,
  sourceport int,
  destinationport int,
  protocol int,
  numpackets int,
  numbytes bigint,
  starttime int,
  endtime int,
  action string,
  logstatus string
)  
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://todel162/AWSLogs/XXXXX/vpcflowlogs/us-east-1/'
TBLPROPERTIES ("skip.header.line.count"="1");

但是如果我按照文档中的建议添加 parition 子句，它不会读取一行。（不过创建成功Table）

https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html

换句话说，我无法在 create table 语句中使用此子句来使用分区。

PARTITIONED BY (dt string)

如何为 vpc 流日志创建一个 table 分区？

Answer 1

创建分区后 table 您还需要向其添加分区。对于分区的 table，LOCATION 属性不指向 table 的数据。新创建的分区 table 基本上是空的。

有很多方法可以将分区添加到分区 table。 VPC 流日志不遵循 Hive 分区方案，这意味着您不能使用 MSCK REPAIR TABLE 加载所有分区。相反，您必须手动列出所有分区并使用 Glue 的 BatchCreatePartition API 调用或使用运行 ALTER TABLE vpc_flow_logs3 ADD PARTITION … 的 Athena 添加它们。您可以在链接到的指南的第 4 步中找到如何为流日志执行此操作的示例。

vpc 流日志的分区

Partitions for vpc flow logs

amazon-athena