AWS Glue ETL 不输出所有记录

AWS Glue ETL Doesn't Output All Records

我有一个 ETL 脚本,旨在使用 Relationalize 展平一组 400 万个 JSON 文件。此脚本在包含 300 个文件的测试集上运行良好,但当 运行 在具有 400 万个文件的 S3 存储桶上时,它仅生成 1500~ 个输出文件,每个文件包含单个记录的数据。

我尝试了该脚本的几种不同配置,但它们都产生了相同的结果:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Begin variables to customize with your information
glue_source_database = "mydatabase"
glue_source_table = "mytable"
glue_temp_storage = "s3://my-data/glue_temp"
glue_relationalize_output_s3_path = "s3://my-data/glue_output/mytable_flat/"
dfc_root_table_name = "root" #default value is "roottable"
# End variables to customize with your information


datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
dfc = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
origdata = dfc.select(dfc_root_table_name)

origdataoutput = glueContext.write_dynamic_frame.from_options(frame = origdata, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path}, format = "json", transformation_ctx = "origdataoutput")

看起来您只是将根 table 传递给 glueContext.create_dynamic_frame.from_catalog 。当您执行关系化时,它将 return 一个 DynamicFrameCollection。要查看此集合中的动态帧列表,请尝试使用 dfc.keys() 打印它们。

请参阅 this and this 中的步骤 6 以了解关系化的工作原理。