AWS Athena - 减少扫描大小

AWS Athena- reduce scan size

amazon-athena

如何减少 AWS athena 中 'select' 查询的数据扫描大小。通过仅扫描其中一列。

示例： SELECT * 来自表 1，其中状态='Fail'；

减小扫描大小的最简单方法是根据 STATUS 值对数据进行分区。

参见user guide for information about partitioning. However, you may want to consider a columnar format such as Apache Parquet as well, which is a columnar data storage and interchange format which is supported by Athena。

使用分栏格式很有用，因为 Athena 只会读取满足查询所需的列。对于 SELECT * 查询，它通常不会产生太大影响，但如果您只对数十或数百列中的少数几列感兴趣，则 I/O 节省可能会很大。此外，Parquet（和 ORC，Athena 也支持的竞争列格式）支持压缩，因此即使访问所有列，它仍然比未压缩的 CSV 或 JSON.

节省很多

参见Athena performance tuning tips。这个 AWS 博客有多个关于减少扫描数据和提高性能的技巧。我看到的主要是：

压缩（因文件格式而异）。参见 Compression formats and SerDe。
Partitioning the data.

AWS Athena - 减少扫描大小

AWS Athena- reduce scan size

amazon-athena