如何只载入 Hadoop/Hive 最近的 365 个文件?

How load into Hadoop/Hive only the 365 more recent files?

我创建一个 table:

CREATE EXTERNAL TABLE events (
  id bigint, received_at string, generated_at string, source_id int, source_name string, source_ip string, facility string, severity string, program string, message string
)
PARTITIONED BY (
  dt string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://mybucket/folder1/folder2';

s3://mybucket/folder1/folder2里面有多个文件夹,命名格式为dt=YYYY-MM-DD/,每个文件夹里面有1个文件,命名格式为YYYY-MM-DD.tsv.gz

然后我通过 MSCK REPAIR TABLE events; 加载 table。当我执行 SELECT * FROM events LIMIT 5; 时,我得到

OK
Failed with exception java.io.IOException:com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 66C6392F74DBED77), S3 Extended Request ID: YPL1P4BO...+fxF+Me//cp7Fxpiuqxds2ven9/4DEc211JI2Q7BLkc=
Time taken: 0.823 seconds

因为超过 365 天的对象已移至 Glacier。

如何以编程方式仅加载 365 个较新的文件,或者更好的是,我可以指定仅加载比设定日期更新 than/named 的文件?

PS:我只会在需要时启动 Hadoop/Hive 集群。它总是从头开始——里面没有以前的数据——因此只关注添加数据,而不是删除数据。

您需要通过仅专门添加支持 S3 的分区来避免 Hive 看到支持 Glacier 的分区。在创建 table 之后,您需要为 365 个日期中的每个日期执行此操作,如下所示:

CREATE EXTERNAL TABLE ...;
ALTER TABLE events ADD PARTITION (dt = '2015-01-01');
ALTER TABLE events ADD PARTITION (dt = '2015-01-02');
ALTER TABLE events ADD PARTITION (dt = '2015-01-03');
...
ALTER TABLE events ADD PARTITION (dt = '2015-12-31');
SELECT * FROM events LIMIT 5;