AWS Glue ETL 作业失败 "Failed to delete key: parquet-output/_temporary"

Question

我是运行 Glue Crawler 生成的 CSV 数据上的 Glue ETL 作业table。爬虫爬到了一个结构如下的目录

   s3 
      -> aggregated output
         ->datafile1.csv
         ->datafile2.csv
     ->datafile3.csv

这些文件聚合成一个"aggregated-output" table，可以在athena中查询成功。

我正在尝试使用 AWS Glue ETL 作业将其转换为镶木地板文件。作业失败

"py4j.protocol.Py4JJavaError: An error occurred while calling 
o92.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: parquet-output/_temporary"

我无法找到这里的根本原因

我试过用几种方法修改 Glue 作业。我确保分配给作业的 IAM 角色有权删除相关存储桶上的文件夹。现在我正在使用 AWS 提供的默认 temp/script 文件夹。我试过在我的 s3 存储桶上使用文件夹，但看到类似的错误

s3://aws-glue-temporary-256967298135-us-east-2/admin
s3://aws-glue-scripts-256967298135-us-east-2/admin/rt-5/13-ETL-Tosat

在下面分享完整的堆栈跟踪

19/05/10 17:57:59 INFO client.RMProxy: Connecting to ResourceManager at ip-172-32-29-36.us-east-2.compute.internal/172.32.29.36:8032
Container: container_1557510304861_0001_01_000002 on ip-172-32-1-101.us-east-2.compute.internal_8041
LogType:stdout
Log Upload Time:Fri May 10 17:57:53 +0000 2019
LogLength:0
Log Contents:
End of LogType:stdout
Container: container_1557510304861_0001_01_000001 on ip-172-32-26-232.us-east-2.compute.internal_8041
LogType:stdout
Log Upload Time:Fri May 10 17:57:54 +0000 2019
LogLength:9048
Log Contents:
null_fields []
Traceback (most recent call last):
File "script_2019-05-10-17-47-48.py", line 40, in <module>
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options =
{
    "path": "s3://changed-s3-path-parquet-output/parquet-output"
}
, format = "parquet", transformation_ctx = "datasink4")
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 585, in from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/context.py", line 193, in write_dynamic_frame_from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/context.py", line 216, in write_from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o92.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: parquet-output/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:689)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:296)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:463)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortJob(FileOutputCommitter.java:482)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:156)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write.apply$mcV$sp(FileFormatWriter.scala:212)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at com.amazonaws.services.glue.SparkSQLDataSink.writeDynamicFrame(DataSink.scala:123)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 1 exceptions thrown from 1 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:375)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy56.deleteAll(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1369)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:687)
... 50 more
Caused by: java.io.IOException: MultiObjectDeleteException thrown with 26 keys in error: parquet-output/_temporary/0/_temporary/attempt_20190510164114_0000_m_000001_1/part-00001-7288591a-dd6f-404a-9c24-1dd70b3ff4ff-c000.snappy.parquet, parquet-output/_temporary/0/_temporary/attempt_20190510164315_0000_m_000001_2/part-00001-7288591a-dd6f-404a-9c24-1dd70b3ff4ff-c000.snappy.parquet, parquet-output/_temporary/0/_temporary_$folder$
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:360)
... 59 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: 071747B6992918B9), S3 Extended Request ID: oTBHx76MMI70zuD86AWeq4+rXa3GalW6ptQFM91ceQwhB9SBJV9Z6qw1yG2Ar2DBOr06soLL5dE=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2084)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:25)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:11)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:80)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:125)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:355)
... 59 more

End of LogType:stdout

Answer 1

通过创建一个具有 S3 和 AWSGlueServiceRole 权限的新 IAM 角色解决了这个问题

Answer 2

今天遇到同样的问题，结果发现不仅需要read/write对用于Glue作业的IAM Role策略中相应的bucket/path的权限，而且s3:DeleteObject 这样作业就可以清理 _temporary 目录。错误消息基本上说它不是在写入时失败，而是在删除步骤时失败。

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::target-bucket/*"
            ]
        }
    ]
}

AWS Glue ETL 作业失败 "Failed to delete key: parquet-output/_temporary"

AWS Glue ETL job failing with "Failed to delete key: parquet-output/_temporary"

csv

etl

amazon-web-services

parquet

aws-glue