从单独的文件中为 AWS Glue 爬虫指定列

Question

我正在使用 Glue 爬虫在 Athena 中为外部提供商生成的一组 CSV 文件创建 table。这些文件没有 headers，而是带有一个单独的 one-line CSV 文件，指定 headers。有超过 1000 列，因此手动编辑架构以从 Glue 的默认 col0、col1、col2 命名列是不得已的方法。有没有办法告诉 Glue/Athena 从与数据不同的文件中选择列名？

Answer 1

我能够使用 boto3 和 update_table 方法来做到这一点。大部分解决方案位于，它给出了重命名单个列的示例。要根据外部文件重命名所有列，而不是我使用以下单列方法：

with open('column_headers.tsv') as cfile:
creader = csv.reader(cfile, delimiter='\t')
for row in creader:
    colnames = row
    
old_colnames = [oc['Name'] for oc in old_table['StorageDescriptor']['Columns']]
col_map = dict(zip(old_colnames, colnames))  

for col in new_table['StorageDescriptor']['Columns']:
    col['Name'] = col_map[col['Name']]
    
client.update_table(DatabaseName=db_name, TableInput=new_table)

从单独的文件中为 AWS Glue 爬虫指定列

Specifying columns for AWS Glue crawler from separate file

amazon-athena

aws-glue