Presto：如何从 s3 中读取分区为子文件夹的整个存储桶？

Question

我需要使用 presto 从 s3 读取位于 "bucket-a" 中的整个数据集。但是，在桶内，数据按年份保存在子文件夹中。所以我有一个看起来像这样的桶：

桶-a>2017>数据

Bucket-a>2018>更多数据

Bucket-a>2019>更多数据

以上所有数据都是一样的table只是在s3中以这种方式保存。请注意，在 bucket-a 中本身没有数据，只是在每个文件夹中。

我要做的是从存储桶中读取所有数据作为单个 table 添加年份作为列或分区。

我试过这样做，但没有成功：

CREATE TABLE hive.default.mytable (
  col1 int,
  col2 varchar,
  year int
)
WITH (
  format = 'json',
  partitioned_by = ARRAY['year'],
  external_location = 's3://bucket-a/'--also tryed 's3://bucket-a/year/'

)

还有

CREATE TABLE hive.default.mytable (
  col1 int,
  col2 varchar,
  year int
)
WITH (
  format = 'json',
  bucketed_by = ARRAY['year'],
  bucket_count = 3,
  external_location = 's3://bucket-a/'--also tryed's3://bucket-a/year/'
)

以上均无效。

我看到有人使用 presto 将分区写入 s3，但我想做的是相反的：从 s3 数据中读取已经在文件夹中拆分为单个 table.

谢谢。

Answer 1

如果您的文件夹遵循 Hive 分区文件夹命名约定 (year=2019/)，您可以将 table 声明为已分区并仅使用 system. sync_partition_metadata procedure in Presto.

现在，您的文件夹不符合惯例，因此您需要使用 system.register_partition 程序将每个文件夹单独注册为一个分区（将在即将发布的 Presto 330 中提供）。（register_partition 的替代方法是运行在 Hive CLI 中适当 ADD PARTITION。）

Presto：如何从 s3 中读取分区为子文件夹的整个存储桶？

Presto: How to read from s3 an entire bucket that is partitioned in sub-folders?

database

amazon-s3

amazon-web-services

presto

partition