Athena (Hive/Presto) 查询分区 table IN 语句

Athena (Hive/Presto) query partitioned table IN statement

我在 Athena (HIVE/Presto) 中有以下分区 table:

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
    id STRING,
    data STRING
)
PARTITIONED BY (
    year string,
    month string,
    day string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket';

数据以 s3://mybucket/year=2020/month=01/day=30/.

的路径结构组织在 s3 中

我想知道以下查询是否会利用分区优化:

SELECT 
  *
FROM 
  mydb.mytable
WHERE 
  (year='2020' AND month='08' AND day IN ('10', '11', '12')) OR 
  (year='2020' AND month='07' AND day IN ('29', '30', '31'));

我假设因为 IN 运算符将在一系列 OR 条件下进行转换,所以这仍然是一个查询,将受益于分区。我说得对吗?

年,documentation.

中也提到了

When Athena runs a query on a partitioned table, it checks to see if any partitioned columns are used in the WHERE clause of the query. If partitioned columns are used, Athena requests the AWS Glue Data Catalog to return the partition specification matching the specified partition columns. The partition specification includes the LOCATION property that tells Athena which Amazon S3 prefix to use when reading data. In this case, only data stored in this prefix is scanned. If you do not use partitioned columns in the WHERE clause, Athena scans all the files that belong to the table's partitions.

遗憾的是,Athena 没有公开可以更轻松地理解如何优化查询的信息。目前你唯一能做的就是 运行 查询的不同变体并查看 GetQueryExecution API 调用中返回的统计数据。

判断 Athena 是否会在查询中使用分区的一种方法是 运行 分区列具有不同值的查询,并确保扫描的数据量不同。如果数据量不同,Athena 能够在查询计划期间 p运行e 分区。