如何更改 Glue Crawler 创建的自动检测分区的列名？

Question

我有一个用作 Kinesis Firehose 流目标的存储桶。

Firehose 使用 yyyy/mm/dd/HH 格式自动在该存储桶上创建基于日期的前缀。

然后我创建了一个爬虫，它将在这个存储桶中搜索数据，并将其配置如下：

在运行爬虫之后，它会创建一个具有以下架构的 table：

| #   | Column name   | Data type | Key           |
| --- | -----------   | --------- | ------------- |
| 1   | numberissues  | int       |               |
| 2   | group         | string    |               |
| 3   | createdat     | string    |               |
| 4   | companyunitid | string    |               |
| 5   | partition_0   | string    | Partition (0) |
| 6   | partition_1   | string    | Partition (1) |
| 7   | partition_2   | string    | Partition (2) |
| 8   | partition_3   | string    | Partition (3) |

如果我将 partition-* 重命名为正确的 year、month、day 和 hour，table 就准备好了给我用。

但是，如果爬虫再次运行，架构会将列名恢复为原来的 partition-*。

我知道这适用于 Hive 分区模式 year=2018/month=04...，但我想知道是否可以 "hint" 粘合分区字段名称。

另一种选择是尝试更改 Firehose 前缀，但我找不到任何表明这甚至是可能的。

Answer 1

在这种情况下，您可以设置 "Ignore the change and don't update the data catalog" 选项。

然后您可以重命名列。这将允许爬虫在下一个运行上检测新分区，但保留重新命名的名称。

Answer 2

现在可以为 Firehose 编写的 S3 前缀指定自定义格式。为了符合 Hive 分区样式，您可以在前缀中使用此语法：

beginning_of_prefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

示例输出：

beginning_of_prefix/year=2021/month=09/day=03/hour=16/

这将使您的 Glue 爬虫能够识别分区的名称。

更详细一点，AWS 引入的 !{namespace:value} 语法允许访问 Firehose 用于分区并将其打印到前缀中的时间戳。这是通过将 timestamp 指定为命名空间并将有效的 Java DateTimeFormatter 字符串指定为值来完成的。请注意：

When evaluating timestamps, Kinesis Data Firehose uses the approximate arrival timestamp of the oldest record that's contained in the Amazon S3 object being written.

还有那个：

If you specify a prefix that doesn't contain a timestamp namespace expression, Kinesis Data Firehose appends the expression !{timestamp:yyyy/MM/dd/HH/}to the value in the Prefix field.

（所以如果你不使用timestamp命名空间，则使用旧的分区方式）

例如，其他名称空间也可以通过错误输出前缀的流水错误类型进行分区。

Source

Docs

如何更改 Glue Crawler 创建的自动检测分区的列名？

How to change column names of autodetected partitions created by Glue Crawler?

amazon-athena

aws-glue

amazon-kinesis-firehose