使用 AWS Athena 在 AWS Glue 中复制 Table

Question

我在 AWS Glue 中有一个 table，它使用 S3 存储桶作为其数据位置。我想对现有 table 执行 Athena 查询，并使用查询结果创建新的 Glue table.

我已经尝试创建一个新的 Glue table，将其指向 S3 中的一个新位置，并将 Athena 查询结果通过管道传输到该 S3 位置。这几乎完成了我想要的，但是

a .csv.metadata 文件连同实际的 .csv 输出（由 Glue table 读取，因为它读取指定 s3 位置中的所有文件）。
csv 文件在每个字段周围放置双引号，这会破坏 Glue Table 中定义的任何使用数字

这些服务都是为了协同工作而设计的，因此必须有一种适当的方法来实现这一点。任何建议将不胜感激:)

Answer 1

我猜你必须改变你的 ser-de。如果您正在查询 csv 数据，opencsvserde 或 lazysimple serde 应该适合您。

Answer 2

方法是使用 CTAS query statements。

A CREATE TABLE AS SELECT (CTAS) query creates a new table in Athena from the results of a SELECT statement from another query. Athena stores data files created by the CTAS statement in a specified location in Amazon S3.

例如：

CREATE TABLE new_table
WITH (
     external_location = 's3://my_athena_results/new_table_files/'
) AS (
    -- Here goes your normal query 
    SELECT 
        *
    FROM 
        old_table;
)

虽然有some limitations。但是，对于您的情况，最重要的是：

在Amazon S3 中存储CTAS 查询结果的目标位置必须为空。
这同样适用于新 table 的名称，即它不应存在于 AWS Glue 数据目录中。

一般来说，由于 Athena 是一个分布式系统，您无法明确控制将创建多少文件作为 CTAS 查询的结果。但是，可以尝试使用 "this workaround"，它在 WITH 子句中使用 bucketed_by 和 bucket_count 字段

CREATE TABLE new_table
WITH (
    external_location = 's3://my_athena_results/new_table_files/',
    bucketed_by=ARRAY['some_column_from_select'],
    bucket_count=1
) AS (
    -- Here goes your normal query 
    SELECT 
        *
    FROM 
        old_table;
)

除了创建新文件和定义与您相关的 table 之外，您还可以将数据转换为不同的文件格式，例如镶木地板，JSON 等

使用 AWS Athena 在 AWS Glue 中复制 Table

Duplicate Table in AWS Glue using AWS Athena

amazon-s3

amazon-web-services

aws-sdk

amazon-athena

aws-glue