如何以编程方式读取 AWS Glue 数据目录 table 模式

Question

我有一组统一结构的日常CSV文件，我将上传到S3。有一个下游作业将 CSV 数据加载到 Redshift 数据库 table。 CSV 中的列数可能会增加，从那时起，新文件将包含新列。发生这种情况时，我想检测更改并自动将列添加到目标 Redshift table。

我的计划是运行源 CSV 文件上的 Glue Crawler。架构中的任何更改都会在 Glue 数据目录中生成新版本的 table。然后，我想使用 Java、.NET 或其他语言以编程方式读取 Glue 数据目录中最新版本 Table 的 table 结构（列及其数据类型）并进行比较它与 Redshift table 的模式。如果找到新列，我将生成一个 DDL 语句来更改 Redshift table 以添加列。

谁能给我指出使用 Java、.NET 或其他语言阅读 Glue 数据目录 table 的任何示例？是否有更好的想法自动将新列添加到 Redshift tables？

Answer 1

如果你想使用Java，使用依赖：

<dependency>
  <groupId>com.amazonaws</groupId>
  <artifactId>aws-java-sdk-glue</artifactId>
  <version>{VERSION}</version>
</dependency>

这是获取 table 版本和列列表的代码片段：

AWSGlue client = AWSGlueClientBuilder.defaultClient();
GetTableVersionsRequest tableVersionsRequest = new GetTableVersionsRequest()
    .withDatabaseName("glue_catalog_database_name")
    .withCatalogId("table_name_generated_by_crawler");
GetTableVersionsResult results = client.getTableVersions(tableVersionsRequest);
// Here you have all the table versions, at this point you can check for new ones
List<TableVersion> versions = results.getTableVersions();
// Here's how to get to the table columns
List<Column> tableColumns = versions.get(0).getTable().getStorageDescriptor().getColumns();

在这里您可以看到 TableVersion and the StorageDescriptor 个对象的 AWS Doc。

您也可以使用 boto3 library for Python。

希望这对您有所帮助。

如何以编程方式读取 AWS Glue 数据目录 table 模式

How to read AWS Glue Data Catalog table schemas programmatically

amazon-redshift

aws-glue