如何使用外部位置进行 CTAS csv.gz

Question

我有将近 90 GB 的数据需要上传到具有特定命名约定的 S3 存储桶。

如果我使用带有 external_location 的 CTAS 查询，它不会给我提供为文件指定特定名称的选项。此外 format csv 不是一个选项。

CREATE TABLE ctas_csv_partitioned 
WITH (
     format = 'TEXTFILE',  
     external_location = 's3://my_athena_results/ctas_csv_partitioned/', 
     partitioned_by = ARRAY['key1']
) 
AS SELECT name1, address1, comment1, key1
FROM tables1

我想上传输出文件，使其看起来像 sample_file.csv.gz

最简单的方法是什么？

Answer 1

不幸的是，无法单独使用 Athena 指定文件名和扩展名。此外，使用 CTAS 查询创建的文件根本没有任何文件扩展名。但是，您可以直接使用 S3 的 CLI 重命名文件。

aws s3 ls s3://path/to/external/location/ --recursive \
| awk '{cmd="aws s3 mv s3://path/to/external/location/" " s3://path/to/external/location/"".csv.gz"; system(cmd)}'

刚刚尝试了这个片段，一切正常。但是，有时也会创建一个空文件 s3://path/to/external/location/.csv.gz。注意我没有为 aws s3 mv 添加 --recursive 选项，因为它也会产生奇怪的结果。

对于format字段，只需在WITH子句中添加field_delimiter=','即可。

CREATE TABLE ctas_csv_partitioned 
WITH (
     format = 'TEXTFILE',
     field_delimiter=','  
     external_location = 's3://my_athena_results/ctas_csv_partitioned/', 
     partitioned_by = ARRAY['key1']
) 
AS SELECT name1, address1, comment1, key1
FROM tables1

如何使用外部位置进行 CTAS csv.gz

How to CTAS with external location as csv.gz

amazon-s3

amazon-web-services

amazon-athena