spark - redshift - s3: class 路径冲突
spark - redshift - s3: class path conflict
我正在尝试使用 hadoop 2.7.2 和 allxuio 从 AWS 上的 spark 2.1.0 独立集群连接到 redshift,这给了我这个错误:Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager <init(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
据我了解,问题出在:
Note on Amazon SDK dependency: This library declares a provided dependency on components of the AWS Java SDK. In most cases, these libraries will be provided by your deployment environment. However, if you get ClassNotFoundExceptions for Amazon SDK classes then you will need to add explicit dependencies on com.amazonaws.aws-java-sdk-core and com.amazonaws.aws-java-sdk-s3 as part of your build / runtime configuration. See the comments in project/SparkRedshiftBuild.scala for more details.
如 spark-redshift-databricks 中所述,我已经尝试了所有可能的 class-path jar 组合,但都出现相同的错误。我放置所有罐子的 spark 提交如下:
/usr/local/spark/bin/spark-submit --class com.XX.XX.app.Test --driver-memory 2G --total-executor-cores 40 --verbose --jars /home/ubuntu/aws-java-sdk-s3-1.11.79.jar,/home/ubuntu/aws-java-sdk-core-1.11.79.jar,/home/ubuntu/postgresql-9.4.1207.jar,/home/ubuntu/alluxio-1.3.0-spark-client-jar-with-dependencies.jar,/usr/local/alluxio/core/client/target/alluxio-core-client-1.3.0-jar-with-dependencies.jar --master spark://XXX.eu-west-1.compute.internal:7077 --executor-memory 4G /home/ubuntu/QAe.jar qa XXX.eu-west-1.compute.amazonaws.com 100 --num-executors 10 --conf spark.executor.extraClassPath=/home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-class-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar:/home/ubuntu/postgresql-9.4.1207.jar --driver-library-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-library-path com.amazonaws.aws-java-sdk-s3:com.amazonaws.aws-java-sdk-core.jar --packages databricks:spark-redshift_2.11:3.0.0-preview1,com.amazonaws:aws-java-sdk-s3:1.11.79,com.amazonaws:aws-java-sdk-core:1.11.79
我的built.sbt:
libraryDependencies += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.4"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.79"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.79"
libraryDependencies += "org.apache.avro" % "avro-mapred" % "1.8.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-redshift" % "1.11.78"
libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1"
libraryDependencies += "org.alluxio" % "alluxio-core-client" % "1.3.0"
libraryDependencies += "com.taxis99" %% "awsscala" % "0.7.3"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion
代码只是从 postgresql 读取并写入 redshift:
val df = spark.read.jdbc(url_read,"public.test", prop).as[Schema.Message.Raw]
.filter("message != ''")
.filter("from_id >= 0")
.limit(100)
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://test.XXX.redshift.amazonaws.com:5439/test?user=test&password=testXXXXX")
.option("dbtable", "table_test")
.option("tempdir", "s3a://redshift_logs/")
.option("forward_spark_s3_credentials", "true")
.option("tempformat", "CSV")
.option("jdbcdriver", "com.amazon.redshift.jdbc42.Driver")
.mode(SaveMode.Overwrite)
.save()
我也在 /home/ubuntu/ 下的所有群集节点上列出了所有 jar 文件。
有谁知道如何在构建过程中添加对 com.amazonaws.aws-java-sdk-core 和 com.amazonaws.aws-java-sdk-s3 的显式依赖 /火花中的运行时配置?或者是罐子本身的问题:我使用了错误的版本 1.11.80 还是 .. 79 等?
我需要从 build.sbt 中排除这些库吗?
将 hadoop 更改为 2.8 会解决问题吗?
以下是我测试所依据的有用链接:
Dependency Management with Sparkere ,
Amazon 倾向于以足够快的速度更改其库的 API,以至于 hadoop-aws.jar 的所有版本都需要与 AWS SDK 同步;对于 Hadoop 2.7.x 这是 SDK 的 v 1.7.4。就目前而言,您可能不会让 redshift 和 s3a 共存,尽管您可能能够继续使用旧的 s3n URL。
当 Hadoop 升级到 1.11.45 时,对较新 SDK 的更新只会在 Hadoop > 2.8 中出现。为什么这么晚?因为这会强制更新 Jackson,最终会破坏下游的所有其他内容。
欢迎来到传递依赖 JAR 地狱的世界,让我们都希望 Java 9 能够解决这个问题——尽管它需要有人(你?)来添加所有相关的模块声明
我正在尝试使用 hadoop 2.7.2 和 allxuio 从 AWS 上的 spark 2.1.0 独立集群连接到 redshift,这给了我这个错误:Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager <init(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
据我了解,问题出在:
Note on Amazon SDK dependency: This library declares a provided dependency on components of the AWS Java SDK. In most cases, these libraries will be provided by your deployment environment. However, if you get ClassNotFoundExceptions for Amazon SDK classes then you will need to add explicit dependencies on com.amazonaws.aws-java-sdk-core and com.amazonaws.aws-java-sdk-s3 as part of your build / runtime configuration. See the comments in project/SparkRedshiftBuild.scala for more details.
如 spark-redshift-databricks 中所述,我已经尝试了所有可能的 class-path jar 组合,但都出现相同的错误。我放置所有罐子的 spark 提交如下:
/usr/local/spark/bin/spark-submit --class com.XX.XX.app.Test --driver-memory 2G --total-executor-cores 40 --verbose --jars /home/ubuntu/aws-java-sdk-s3-1.11.79.jar,/home/ubuntu/aws-java-sdk-core-1.11.79.jar,/home/ubuntu/postgresql-9.4.1207.jar,/home/ubuntu/alluxio-1.3.0-spark-client-jar-with-dependencies.jar,/usr/local/alluxio/core/client/target/alluxio-core-client-1.3.0-jar-with-dependencies.jar --master spark://XXX.eu-west-1.compute.internal:7077 --executor-memory 4G /home/ubuntu/QAe.jar qa XXX.eu-west-1.compute.amazonaws.com 100 --num-executors 10 --conf spark.executor.extraClassPath=/home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-class-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar:/home/ubuntu/postgresql-9.4.1207.jar --driver-library-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-library-path com.amazonaws.aws-java-sdk-s3:com.amazonaws.aws-java-sdk-core.jar --packages databricks:spark-redshift_2.11:3.0.0-preview1,com.amazonaws:aws-java-sdk-s3:1.11.79,com.amazonaws:aws-java-sdk-core:1.11.79
我的built.sbt:
libraryDependencies += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.4"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.79"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.79"
libraryDependencies += "org.apache.avro" % "avro-mapred" % "1.8.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-redshift" % "1.11.78"
libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1"
libraryDependencies += "org.alluxio" % "alluxio-core-client" % "1.3.0"
libraryDependencies += "com.taxis99" %% "awsscala" % "0.7.3"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion
代码只是从 postgresql 读取并写入 redshift:
val df = spark.read.jdbc(url_read,"public.test", prop).as[Schema.Message.Raw]
.filter("message != ''")
.filter("from_id >= 0")
.limit(100)
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://test.XXX.redshift.amazonaws.com:5439/test?user=test&password=testXXXXX")
.option("dbtable", "table_test")
.option("tempdir", "s3a://redshift_logs/")
.option("forward_spark_s3_credentials", "true")
.option("tempformat", "CSV")
.option("jdbcdriver", "com.amazon.redshift.jdbc42.Driver")
.mode(SaveMode.Overwrite)
.save()
我也在 /home/ubuntu/ 下的所有群集节点上列出了所有 jar 文件。
有谁知道如何在构建过程中添加对 com.amazonaws.aws-java-sdk-core 和 com.amazonaws.aws-java-sdk-s3 的显式依赖 /火花中的运行时配置?或者是罐子本身的问题:我使用了错误的版本 1.11.80 还是 .. 79 等? 我需要从 build.sbt 中排除这些库吗? 将 hadoop 更改为 2.8 会解决问题吗?
以下是我测试所依据的有用链接:
Dependency Management with Sparkere ,
Amazon 倾向于以足够快的速度更改其库的 API,以至于 hadoop-aws.jar 的所有版本都需要与 AWS SDK 同步;对于 Hadoop 2.7.x 这是 SDK 的 v 1.7.4。就目前而言,您可能不会让 redshift 和 s3a 共存,尽管您可能能够继续使用旧的 s3n URL。
当 Hadoop 升级到 1.11.45 时,对较新 SDK 的更新只会在 Hadoop > 2.8 中出现。为什么这么晚?因为这会强制更新 Jackson,最终会破坏下游的所有其他内容。
欢迎来到传递依赖 JAR 地狱的世界,让我们都希望 Java 9 能够解决这个问题——尽管它需要有人(你?)来添加所有相关的模块声明