Spark jars 中的从源代码重新编译 类 会破坏 sbt 的合并吗?

Are recompiled-from-source classes in Spark jars breaking sbt's merge?

尝试使用 sbt 创建一个 fat jar 会出现如下错误:

java.lang.RuntimeException: deduplicate: different file contents found in the following:
C:\Users\db\.ivy2\cache\org.apache.spark\spark-network-common_2.10\jars\spark-network-common_2.10-1.6.3.jar:com/google/common/base/Function.class
C:\Users\db\.ivy2\cache\com.google.guava\guava\bundles\guava-14.0.1.jar:com/google/common/base/Function.class

有很多class,这只是为了举例。 Guava 14.0.1 是两个罐子中 Function.class 的版本:

[info]  +-com.google.guava:guava:14.0.1
...
[info]  | | +-com.google.guava:guava:14.0.1

这意味着 sbt/ivy 不会选择一个作为较新的版本,但是罐子中的大小和日期不同,这可能会导致上述错误:

$ jar tvf /c/Users/db/.ivy2/cache/org.apache.spark/spark-network-common_2.10/jars/spark-network-common_2.10-1.6.3.jar | grep "com/google/common/base/Function.class"
   549 Wed Nov 02 16:03:20 CDT 2016 com/google/common/base/Function.class

$ jar tvf /c/Users/db/.ivy2/cache/com.google.guava/guava/bundles/guava-14.0.1.jar  | grep "com/google/common/base/Function.class"
   543 Thu Mar 14 19:56:52 CDT 2013 com/google/common/base/Function.class

Apache 似乎正在从源代码重新编译 Function.class 而不是包含最初编译的 class。 这是对这里发生的事情的正确理解吗?现在,可以使用 sbt 排除重新编译的 classes,但是是否存在 一种构建 jar 而不明确排除每个包含按名称重新编译的源的 jar 的方法?明确排除罐子会导致某些事情 按照下面的代码片段,这让我觉得我走错了路:

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.3"
  excludeAll(
    ExclusionRule(organization = "com.twitter"),
    ExclusionRule(organization = "org.apache.spark", name = "spark-network-common_2.10"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-client"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-hdfs"),
    ExclusionRule(organization = "org.tachyonproject", name = "tachyon-client"),
    ExclusionRule(organization = "commons-beanutils", name = "commons-beanutils"),
    ExclusionRule(organization = "commons-collections", name = "commons-collections"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-yarn-api"),
    ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-yarn-common"),
    ExclusionRule(organization = "org.apache.curator", name = "curator-recipes")
  )
,
libraryDependencies += "org.apache.spark" %% "spark-network-common" % "1.6.3" exclude("com.google.guava", "guava"),
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "1.6.3",
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging-slf4j" % "2.1.2",
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.2.0" exclude("com.google.guava", "guava"),
libraryDependencies += "com.google.guava" % "guava" % "14.0.1",
libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11",
libraryDependencies += "org.json4s" %% "json4s-ext" % "3.2.11",
libraryDependencies += "com.rabbitmq" % "amqp-client" % "4.1.1",
libraryDependencies += "commons-codec" % "commons-codec" % "1.10",

如果这是错误的路径,什么是更简洁的方法?

If that's the wrong path, what's a cleaner way?

更简洁的方法是根本不打包 spark-core,当您在目标机器上安装 Spark 时,它就可供您使用,并且在运行时可供您的应用程序使用(您通常可以在 /usr/lib/spark/jars).

您应该将这些 spark 依赖项标记为 % provided。这应该可以帮助您避免因打包这些 jar 而造成的许多冲突。