具有所需输出列名称的 AWS Glue 简单自定义转换

AWS Glue simple custom transformation with desired output column names

假设我有 s3://mybucket/mydata/ 有包含以下列的 csv 文件:

颜色、形状、数量、成本

类型为:

字符串,字符串,双精度,双精度

作为一个人为的例子,假设我想转换数据并将其转储到 s3://mybucket/mydata-transformed/,方法是将字符串转换为大写,并将 2 添加到双精度数。因此,一行 red,circle,2,21.7 会在输出中变成 RED,CIRCLE,4,23.7 。下面的代码有点像我想要的那样(样板代码省略),其中已经为源存储桶创建了 table“mydata”:

DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my database", table_name = "mydata", transformation_ctx = "DataSource0")
ds_df = DataSource0.toDF()
ds_df.select("color","shape","quantity","cost").show()
ds_df1 = ds_df.select(upper(col('color')),upper(col('shape')),col('quantity')+2,col('cost')+2)
Transform0 = DynamicFrame.fromDF(ds_df1, glueContext, "Transform0")
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, connection_type = "s3", format = "json", connection_options = {"path": "s3://mybucket/mydata-transformed/", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()

下面是上述示例数据的结果 json:

{"upper(color)":"RED","upper(shape)":"CIRCLE","(quantity + 2)":4.0,"(cost + 2)":23.7}

数据已正确转换。但是,列名称现在是 "upper(color)","upper(shape)","(quantity + 2)","(cost + 2)"。如何使结果列名称为 color,shape,quantity,cost?

要解决问题,您可以使用 alias。查看下面的完整示例:

import pyspark.sql.functions as f

jsonStr = """{  "color": "red", "shape": "square","quantity":4,"cost":"11.11" }"""
df = spark.read.json(sc.parallelize([jsonStr]))
df.show()

+-----+-----+--------+------+
|color| cost|quantity| shape|
+-----+-----+--------+------+
|  red|11.11|       4|square|
+-----+-----+--------+------+


ds_df1 = df.select(upper(col('color')).alias('color'),upper(col('shape')).alias('shape'),'quantity','cost')

ds_df1.show()

+-----+------+--------+-----+
|color| shape|quantity| cost|
+-----+------+--------+-----+
|  RED|SQUARE|       4|11.11|
+-----+------+--------+-----+