在 Datastax Spark 提交中使用 Scala 将文件从 S3 存储桶读取到 Spark Dataframe,给出 AWS 错误消息:错误请求
Read Files from S3 bucket to Spark Dataframe using Scala in Datastax Spark Submit giving AWS Error Message: Bad Request
我正在尝试读取位于孟买的 s3 存储桶上的 CSV 文件Region.I正在尝试使用 datastax dse spark-submit 读取文件。
我尝试将 hadoop-aws 版本更改为其他各种版本。目前hadoop-aws版本为2.7.3
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val df = spark.read.csv("s3a://bucket_path/csv_name.csv")
执行后,出现以下错误,
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400,
AWS Service: Amazon S3, AWS Request ID: 8C7D34A38E359FCE, AWS Error
Code: null, AWS Error Message: Bad Request at
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:92) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at
org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:350)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:350)
at
scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355) at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
您的签名 V4 选项未应用。参见
在 运行 spark-submit 或 spark-shell 时添加 java 选项。
spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
或者,设置系统属性如:
System.setProperty("com.amazonaws.services.s3.enableV4", "true");
感谢大家的帮助。我从 Lamanus 的回答中发现即使在
中添加了签名 V4 选项也没有应用
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
所以我添加了以下行,现在代码可以正常工作了。
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
我正在尝试读取位于孟买的 s3 存储桶上的 CSV 文件Region.I正在尝试使用 datastax dse spark-submit 读取文件。
我尝试将 hadoop-aws 版本更改为其他各种版本。目前hadoop-aws版本为2.7.3
spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val df = spark.read.csv("s3a://bucket_path/csv_name.csv")
执行后,出现以下错误,
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 8C7D34A38E359FCE, AWS Error Code: null, AWS Error Message: Bad Request at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:350) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun.apply(DataSource.scala:350) at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:355) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
您的签名 V4 选项未应用。参见
在 运行 spark-submit 或 spark-shell 时添加 java 选项。
spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
或者,设置系统属性如:
System.setProperty("com.amazonaws.services.s3.enableV4", "true");
感谢大家的帮助。我从 Lamanus 的回答中发现即使在
中添加了签名 V4 选项也没有应用spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
所以我添加了以下行,现在代码可以正常工作了。
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")