如何配置 Spark / Glue 以避免在 Glue 作业成功执行后创建空的 $_folder_$

Question

我有一个简单的 glue etl 作业，它由 Glue 工作流触发。它从爬虫 table 中删除重复数据并将结果写回 S3 存储桶。作业成功完成。但是，spark 生成的空文件夹“$folder$”保留在 s3 中。它在层次结构中看起来不太好，会引起混乱。成功完成作业后，有什么方法可以将 spark 或 glue 上下文配置到 hide/remove 这些文件夹？

--------------------S3图像--------------------

Answer 1

好的，经过几天的测试，我终于找到了解决方案。在粘贴代码之前，让我总结一下我发现的内容......

这些 $folder$ 是通过 Hadoop 创建的。Apache Hadoop 在 S3 存储桶中创建文件夹时创建这些文件。 Source1 它们实际上是目录标记，如路径 + /。 Source 2
要更改行为，您需要更改 Spark 上下文中的 Hadoop S3 写入配置。阅读 this and this and
了解 S3、S3a 和 S3n here and here
感谢@stevel 的评论here

现在的解决方案是在Spark context Hadoop中设置如下配置。

sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

为避免创建 SUCCESS 文件，您还需要设置以下配置： hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

确保使用 S3 URI 写入 s3 存储桶。例如：

myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

如何配置 Spark / Glue 以避免在 Glue 作业成功执行后创建空的 $_folder_$

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

amazon-web-services

aws-glue

aws-glue-workflow

aws-glue-spark