为什么 Spark 应用程序以 "ClassNotFoundException: Failed to find data source: jdbc" as uber-jar with sbt assembly 失败?
Why does Spark application fail with "ClassNotFoundException: Failed to find data source: jdbc" as uber-jar with sbt assembly?
我正在尝试 assemble 使用 sbt 1.0.4 和 sbt-assembly 0.14.6 的 Spark 应用程序。
Spark 应用程序在 IntelliJ IDEA 或 spark-submit
中启动时工作正常,但如果我 运行 使用命令行 assembled uber-jar([=36 中的 cmd =] 10):
java -Xmx1024m -jar my-app.jar
我得到以下异常:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html
Spark 应用程序如下所示。
package spark.main
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val connectionProperties = new Properties()
connectionProperties.put("user","postgres")
connectionProperties.put("password","postgres")
connectionProperties.put("driver", "org.postgresql.Driver")
val testTable = "test_tbl"
val spark = SparkSession.builder()
.appName("Postgres Test")
.master("local[*]")
.config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
.config("spark.sql.warehouse.dir", System.getProperty("java.io.tmpdir") + "swd")
.getOrCreate()
val dfPg = spark.sqlContext.read.
jdbc("jdbc:postgresql://localhost/testdb",testTable,connectionProperties)
dfPg.show()
}
}
以下为build.sbt
.
name := "apache-spark-scala"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"
mainClass in Compile := Some("spark.main.Main")
libraryDependencies ++= {
val sparkVer = "2.1.1"
val postgreVer = "42.0.0"
val cassandraConVer = "2.0.2"
val configVer = "1.3.1"
val logbackVer = "1.7.25"
val loggingVer = "3.7.2"
val commonsCodecVer = "1.10"
Seq(
"org.apache.spark" %% "spark-sql" % sparkVer,
"org.apache.spark" %% "spark-core" % sparkVer,
"com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
"org.postgresql" % "postgresql" % postgreVer,
"com.typesafe" % "config" % configVer,
"commons-codec" % "commons-codec" % commonsCodecVer,
"com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
"org.slf4j" % "slf4j-api" % logbackVer
)
}
dependencyOverrides ++= Seq(
"io.netty" % "netty-all" % "4.0.42.Final",
"commons-net" % "commons-net" % "2.2",
"com.google.guava" % "guava" % "14.0.1"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
有没有人知道,为什么?
[更新]
从官方 GitHub 存储库获取的配置成功了:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) =>
xs map {_.toLowerCase} match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
MergeStrategy.discard
case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
MergeStrategy.discard
case "services" :: _ => MergeStrategy.filterDistinctLines
case _ => MergeStrategy.first
}
case _ => MergeStrategy.first
}
问题几乎是 with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly 插件的配置。
数据源的简称(又名别名),例如jdbc
或 kafka
,仅当相应的 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
注册了 DataSourceRegister
.
时才可用
为了使 jdbc
别名起作用,Spark SQL 使用 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister 和以下条目(还有其他条目):
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
That's what ties jdbc
alias 搞定数据源。
并且您已通过以下 assemblyMergeStrategy
.
将其从 uber-jar 中排除
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
注意 case PathList("META-INF", xs @ _*)
您只需 MergeStrategy.discard
。这就是根本原因。
只是为了检查 "infrastructure" 是否可用并且您可以通过其完全限定名称(而不是别名)使用 jdbc
数据源,试试这个:
spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")
由于缺少 url
等选项,您会看到其他问题,但是...我们离题了。
一个解决方案是 MergeStrategy.concat
所有 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
(这将创建一个包含所有数据源的 uber-jar,包括 jdbc
数据源)。
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
我正在尝试 assemble 使用 sbt 1.0.4 和 sbt-assembly 0.14.6 的 Spark 应用程序。
Spark 应用程序在 IntelliJ IDEA 或 spark-submit
中启动时工作正常,但如果我 运行 使用命令行 assembled uber-jar([=36 中的 cmd =] 10):
java -Xmx1024m -jar my-app.jar
我得到以下异常:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html
Spark 应用程序如下所示。
package spark.main
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val connectionProperties = new Properties()
connectionProperties.put("user","postgres")
connectionProperties.put("password","postgres")
connectionProperties.put("driver", "org.postgresql.Driver")
val testTable = "test_tbl"
val spark = SparkSession.builder()
.appName("Postgres Test")
.master("local[*]")
.config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
.config("spark.sql.warehouse.dir", System.getProperty("java.io.tmpdir") + "swd")
.getOrCreate()
val dfPg = spark.sqlContext.read.
jdbc("jdbc:postgresql://localhost/testdb",testTable,connectionProperties)
dfPg.show()
}
}
以下为build.sbt
.
name := "apache-spark-scala"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"
mainClass in Compile := Some("spark.main.Main")
libraryDependencies ++= {
val sparkVer = "2.1.1"
val postgreVer = "42.0.0"
val cassandraConVer = "2.0.2"
val configVer = "1.3.1"
val logbackVer = "1.7.25"
val loggingVer = "3.7.2"
val commonsCodecVer = "1.10"
Seq(
"org.apache.spark" %% "spark-sql" % sparkVer,
"org.apache.spark" %% "spark-core" % sparkVer,
"com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
"org.postgresql" % "postgresql" % postgreVer,
"com.typesafe" % "config" % configVer,
"commons-codec" % "commons-codec" % commonsCodecVer,
"com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
"org.slf4j" % "slf4j-api" % logbackVer
)
}
dependencyOverrides ++= Seq(
"io.netty" % "netty-all" % "4.0.42.Final",
"commons-net" % "commons-net" % "2.2",
"com.google.guava" % "guava" % "14.0.1"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
有没有人知道,为什么?
[更新]
从官方 GitHub 存储库获取的配置成功了:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) =>
xs map {_.toLowerCase} match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
MergeStrategy.discard
case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
MergeStrategy.discard
case "services" :: _ => MergeStrategy.filterDistinctLines
case _ => MergeStrategy.first
}
case _ => MergeStrategy.first
}
问题几乎是
数据源的简称(又名别名),例如jdbc
或 kafka
,仅当相应的 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
注册了 DataSourceRegister
.
为了使 jdbc
别名起作用,Spark SQL 使用 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister 和以下条目(还有其他条目):
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
That's what ties jdbc
alias 搞定数据源。
并且您已通过以下 assemblyMergeStrategy
.
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
注意 case PathList("META-INF", xs @ _*)
您只需 MergeStrategy.discard
。这就是根本原因。
只是为了检查 "infrastructure" 是否可用并且您可以通过其完全限定名称(而不是别名)使用 jdbc
数据源,试试这个:
spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")
由于缺少 url
等选项,您会看到其他问题,但是...我们离题了。
一个解决方案是 MergeStrategy.concat
所有 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
(这将创建一个包含所有数据源的 uber-jar,包括 jdbc
数据源)。
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat