尝试在 AWS 粘合作业中包含自定义 SQL。我怀疑我需要更换

Question

我需要在 AWS Glue 作业中包含自定义 SQL 语句

我怀疑我需要改变数据源 0 = glueContext.create_dynamic_frame.from_catalog

.from_catalog”的方法及使用 "create_dynamic_frame_from_rdd" 但不确定如何实施。我无法在网上找到任何示例。

我到底想达到什么目的：我有一个空的 Athena 分区 table。我计划从另一个分区 table 的 Athena 加载数据。除了我的目标 table 有 2 个附加列。我确实有一个自定义 SQL，它从源 table 中选择数据并包括 2 个额外的新列，但不确定它如何适合 AWS Glue 作业。谁能帮忙？谢谢

Answer 1

1) 从目录创建一个动态框架（使用 source athena table）

2) 将动态帧转换为数据帧

3) 在spark

中将动态帧注册为temp-table

4) 在此临时 table.

上执行您的 sql 查询

5) 将数据写回S3。

示例代码：

DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")
df = DyF.toDF()
df.registerTempTable('{{name}}')
df = sqlContext.sql('{{your select query with table name that you used for temp table above}}
df.write.format('{{orc/parquet/whatever}}').partitionBy("{{columns}}").save('path to s3 location')

此外，由于您的数据已分区并且您已经创建了输出 athena table，因此您需要在 athena 中执行 MSCK REPAIR 命令以在 table 元数据中加载分区。

尝试在 AWS 粘合作业中包含自定义 SQL。我怀疑我需要更换

Trying to include custom SQL in an AWS glue job. I suspect I need to replace

aws-glue