星火 1.6.1 S3 MultiObjectDeleteException
Spark 1.6.1 S3 MultiObjectDeleteException
我正在使用 Spark 使用 S3A URI 将数据写入 S3。
我还利用 s3-external-1.amazonaws.com 端点来避免 us-east1.
上的写后读最终一致性问题
尝试向S3写入一些数据(实际上是移动操作)时出现以下问题:
com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error Message: One or more objects could not be deleted, S3 Extended Request ID: null
at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:1745)
at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:687)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:381)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:314)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:346)
at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc.apply(persistentEventStreamBootstrap.scala:122)
at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc.apply(persistentEventStreamBootstrap.scala:112)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$$anonfun$apply$mcV$sp.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$$anonfun$apply$mcV$sp.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$$anonfun$apply$mcV$sp.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$$anonfun$apply$mcV$sp.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$$anonfun$apply$mcV$sp.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
对象:
我运行在升级到Spark 2.0.0的时候遇到了这个问题,结果是缺少S3权限。我目前 运行 Spark 2.0.0 使用 aws-java-sdk-1.7.4 和 hadoop-aws-2.7.2 作为依赖项。
要解决此问题,我必须将 s3:Delete*
操作添加到相应的 IAM 策略中。根据您的环境设置方式,这可能是 S3 存储桶上的策略、SECRET_KEY 您的 Hadoop s3a 库正在连接的用户的策略,或者是 Spark 所在的 EC2 实例的 IAM 角色策略运行.
就我而言,我的工作 IAM 角色策略现在如下所示:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Delete*", "s3:Get*", "s3:List*", "s3:PutObject"
],
"Resource": "arn:aws:s3:::mybucketname/*"
}
]
}
这是通过 S3 或 IAM AWS 控制台进行的快速更改,应立即应用,无需重启 Spark 集群。如果您不确定如何编辑策略,我提供了更多详细信息 here。
这也可能是由 >1 个进程试图删除路径的竞争条件引起的; HADOOP-14101 表明。
在那种特定情况下,您应该能够通过将 hadoop 选项 fs.s3a.multiobjectdelete.enable
设置为 false
来使堆栈跟踪消失。
更新,2017-02-23
为此编写了一些测试,我无法复制它来删除不存在的路径,但已经为权限问题做了。假设这就是现在的原因,尽管我们欢迎更多堆栈跟踪来帮助确定问题。 HADOOP-11572 涵盖问题,包括补丁、文档和更好地记录问题(即记录失败的路径和特定错误)。
我正在使用 Spark 使用 S3A URI 将数据写入 S3。
我还利用 s3-external-1.amazonaws.com 端点来避免 us-east1.
尝试向S3写入一些数据(实际上是移动操作)时出现以下问题:
com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error Message: One or more objects could not be deleted, S3 Extended Request ID: null
at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:1745)
at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:687)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:381)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:314)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:346)
at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc.apply(persistentEventStreamBootstrap.scala:122)
at com.mgmg.memengine.stream.app.persistentEventStreamBootstrap$$anonfun$setupSsc.apply(persistentEventStreamBootstrap.scala:112)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$$anonfun$apply$mcV$sp.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$$anonfun$apply$mcV$sp.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$$anonfun$apply$mcV$sp.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$$anonfun$apply$mcV$sp.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$$anonfun$apply$mcV$sp.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
对象:
我运行在升级到Spark 2.0.0的时候遇到了这个问题,结果是缺少S3权限。我目前 运行 Spark 2.0.0 使用 aws-java-sdk-1.7.4 和 hadoop-aws-2.7.2 作为依赖项。
要解决此问题,我必须将 s3:Delete*
操作添加到相应的 IAM 策略中。根据您的环境设置方式,这可能是 S3 存储桶上的策略、SECRET_KEY 您的 Hadoop s3a 库正在连接的用户的策略,或者是 Spark 所在的 EC2 实例的 IAM 角色策略运行.
就我而言,我的工作 IAM 角色策略现在如下所示:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Delete*", "s3:Get*", "s3:List*", "s3:PutObject"
],
"Resource": "arn:aws:s3:::mybucketname/*"
}
]
}
这是通过 S3 或 IAM AWS 控制台进行的快速更改,应立即应用,无需重启 Spark 集群。如果您不确定如何编辑策略,我提供了更多详细信息 here。
这也可能是由 >1 个进程试图删除路径的竞争条件引起的; HADOOP-14101 表明。
在那种特定情况下,您应该能够通过将 hadoop 选项 fs.s3a.multiobjectdelete.enable
设置为 false
来使堆栈跟踪消失。
更新,2017-02-23
为此编写了一些测试,我无法复制它来删除不存在的路径,但已经为权限问题做了。假设这就是现在的原因,尽管我们欢迎更多堆栈跟踪来帮助确定问题。 HADOOP-11572 涵盖问题,包括补丁、文档和更好地记录问题(即记录失败的路径和特定错误)。