写入时的 AWS Glue 性能

Question

执行连接和聚合后，我希望输出在 1 个文件中，并基于某些列进行分区。当我使用 repartition(1) 时，作业花费的时间是 1 小时，如果我删除 preparation(1)，该文件将有多个分区，需要 30 分钟（参考下面的例子）。那么有没有办法将数据写入1个文件？？

...
...
df= df.repartition(1)
glueContext.write_dynamic_frame.from_options(
    frame = df,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://s3path"
        "partitionKeys": ["choice"]
        }, 
    format = "csv",  
    transformation_ctx = "datasink2")

有没有其他方法可以提高写入性能。改变格式有帮助吗？以及如何通过输出 1 个文件来实现并行性

S3 存储示例

**if repartition(1)** // what I want but takes more time
choice=0/part-00-001
..
..
choice=500/part-00-001

**if removed** // takes less time but multiple files are present
choice=0/part-00-001
 ....
 choice=0/part-00-0032
..
..
choice=500/part-00-001
 ....
 choice=500/part-00-0032

Answer 1

而不是使用 df.repartition(1)

使用df.repartition("选择")

df= df.repartition("choice")
glueContext.write_dynamic_frame.from_options(
    frame = df,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://s3path"
        "partitionKeys": ["choice"]
        }, 
    format = "csv",  
    transformation_ctx = "datasink2")

Answer 2

如果目标是拥有一个文件，请使用合并而不是重新分区，它可以避免数据混洗。

写入时的 AWS Glue 性能

AWS Glue performance when write

apache-spark

pyspark

aws-glue