如何在 IDE 上的常规 Scala 项目中使用 Delta Lake

How to make use of Delta Lake on a regular Scala project on IDE

我已经在 build.sbt

中添加了增量依赖项
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.apache.spark" %% "spark-hive" % sparkVersion,
  // logging
  "org.apache.logging.log4j" % "log4j-api" % "2.4.1",
  "org.apache.logging.log4j" % "log4j-core" % "2.4.1",
  // postgres for DB connectivity
  "org.postgresql" % "postgresql" % postgresVersion,
  "io.delta" %% "delta-core" % "0.7.0"

但是,我不知道 spark 会话必须包含什么配置。下面的代码失败了。

val spark = SparkSession.builder()
    .appName("Spark SQL Practice")
    .config("spark.master", "local")
    .config("spark.network.timeout"  , "10000000s")//to avoid Heartbeat exception
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate()

异常 -

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/MergeIntoTable

您需要升级 Apache Spark。 MergeIntoTable 功能是在 v3.0.0 版本中引入的。 Link 来源:AstBuilder.scala, Analyzer.scala, Github Pull Request, Release Notes(查看功能增强部分)。

这里 an example project I made 会对你有所帮助。

build.sbt 文件应包含以下依赖项:

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0" % "provided"
libraryDependencies += "io.delta" %% "delta-core" % "0.7.0" % "provided"

我认为你需要 using Spark 3 for Delta Lake 0.7.0

您不需要任何特殊的 SparkSession 配置选项,像这样应该没问题:

lazy val spark: SparkSession = {
  SparkSession
    .builder()
    .master("local")
    .appName("spark session")
    .config("spark.databricks.delta.retentionDurationCheck.enabled", "false")
    .getOrCreate()
}

这是因为存在您的代码所依赖的 class 文件,并且该文件在编译时存在但在运行时未找到。寻找构建时间和运行时的差异 classpaths.

更具体到您的场景:

If you get  java.lang.NoClassDefFoundError on
org/apache/spark/sql/catalyst/plans/logical/MergeIntoTable exception 
in this case JAR version does not have MergeIntoTable.scala file. 
The solution was to add the apache spark latest version, which comes with the
org/apache/spark/sql/catalyst/plans/logical/MergeIntoTable.scala file . 

spark 3.x.x 升级和发布中的更多信息 - https://github.com/apache/spark/pull/26167