是否可以使用 AWS 胶水爬虫重新分区数据？

Question

我从以前的同事那里继承了一个S3 bucket，里面的文件是按照id和时间分区的，比如：

s3://bucket/partition_id=0/年=2017/月=6/日=1/文件

所有这些文件的数据是一个table，可以通过Athena查询。从 Glue 目录中，它还显示分区 (0) 是 id，分区 (1) 是年份等等。

最近想重构工作，觉得分区用id不是很直接。我尝试使用 Glue 爬虫并将其定向到 S3 存储桶。但是如果我只希望它按时间分区，而不是按 id 分区，我就无法选择，就像这样：

s3://bucket/year=2017/月=6/日=1/档

我对 AWS 还很陌生，不确定它是否可行，甚至对您有意义。请给我一些反馈。谢谢。

Answer 1

我不认为你可以在爬虫的帮助下做到这一点，但是你可以像这样在 Athena 中手动创建新的 table（另请参阅 https://docs.aws.amazon.com/en_us/athena/latest/ug/ctas-examples.html）

CREATE TABLE new_table
WITH (
     format = 'ORC', 
     external_location = 's3://...', 
     partitioned_by = ARRAY['year', 'month', 'day']) 
AS select * 
FROM old_table;

Answer 2

使用 s3 boto api 编写 python shell 作业以重组文件夹结构，然后运行爬虫

是否可以使用 AWS 胶水爬虫重新分区数据？

Is it possible to re-partition the data using AWS glue crawler?

amazon-s3

amazon-web-services

partition

aws-glue