以编程方式获取 Clustering/Bucketing 列

Question

作为参考，我通过 sqlalchemy 连接到 amazon-athena，主要使用：

create_engine(
            f'awsathena+rest://:@athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}',
            echo=True)

在大多数遵循 ANSI-SQL 标准的关系数据库中，我可以通过 [=42] 以编程方式获取 table 的 partition 列=] 类似于以下内容：

select *
from information_schema.columns
where table_name='myTable' and table_schema='mySchema'
    and extra_info = 'partition key'

但是 bucketing 或 clustering 列似乎没有类似的标记。我知道我可以通过以下方式访问此信息：

show create table mySchema.myTable

但我对干净的编程解决方案感兴趣，如果存在的话。我试图不重新发明轮子。请告诉我如何执行此操作或指向相关文档。

提前致谢。

PS：如果有关于 table 的其他信息就好了，比如文件位置和 存储格式 也可以通过编程方式访问。

Answer 1

Athena 使用 Glue 数据目录存储有关数据库和 table 的元数据。我不知道在 information_schema 中公开了多少内容，而且关于它的文档也很少。

但是，您可以通过直接查询 Glue 数据目录来获取 Athena 知道的所有信息。在这种情况下，如果您调用 GetTable（例如 aws glue get-table …），您将在 Table.StorageDescriptor.BucketColumns.

中找到分桶信息

GetTable 调用还会为您提供文件的存储格式和位置（但对于分区 table，您需要使用 GetPartitions 进行额外调用以检索文件每个分区数据的位置）。

以编程方式获取 Clustering/Bucketing 列

getting Clustering/Bucketing columns programmatically

sql

hive

sqlalchemy

presto

amazon-athena