Spark 启动器：无法看到失败的 SQL 查询的完整堆栈跟踪

Question

我正在使用 SparkLauncher 在 Yarn 之上以集群模式连接到 Spark。我正在运行使用 Scala 编写一些 SQL 代码，如下所示：

def execute(code: String): Unit = {
    try {
      val resultDataframe = spark.sql(code)
      resultDataframe.write.json("s3://some/prefix")
    catch {
      case NonFatal(f) =>
        log.warn(s"Fail to execute query $code", f)
        log.info(f.getMessage, getNestedStackTrace(f, Seq[String]()))
    } 
}

def getNestedStackTrace(e: Throwable, msg: Seq[String]): Seq[String] = {
   if (e.getCause == null) return msg
   getNestedStackTrace(e.getCause, msg ++ e.getStackTrace.map(_.toString))
}

现在，当我运行使用 execute() 方法应该失败的查询时，例如，查询没有分区谓词的分区 table - select * from partitioned_table_on_dt limit 1;，我返回不正确的堆栈跟踪。

当我从 spark-shell:[=38= 手动运行 spark.sql(code).write.json() 时更正堆栈跟踪]

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *(1) LocalLimit 1 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131) org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127) ... Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: No partition predicate found for partitioned table partitioned_table_on_dt. If the table is cached in memory then turn off this check by setting hive.mapred.mode to nonstrict at org.apache.spark.sql.hive.execution.HiveTableScanExec.prunePartitions(HiveTableScanExec.scala:155) ... org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) ...

上面 execute() 方法的错误堆栈跟踪：

Job Aborted: "org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)", "org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)", "org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)", "org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)", ... "org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)", "org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)", "org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)", "org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)", ...

spark-shell 堆栈跟踪有三个嵌套异常 SparkException(SemanticException (TreeNodeException)) 但我在代码中看到的回溯是仅来自 SparkException 和 TreeNodeException，但即使在 getNestedStackTrace() 方法中获取嵌套堆栈跟踪后，最有价值的 SemanticException 回溯也丢失了。

任何 Spark/Scala 专家都可以告诉我我做错了什么，或者我如何在此处获取完整的堆栈跟踪，但没有所有异常？

Answer 1

递归方法 getNestedStackTrace() 有一个错误。

def getNestedStackTrace(e: Throwable, msg: Seq[String]): Seq[String] = {
   if (e == null) return msg // this should be e not e.getCause  
   getNestedStackTrace(e.getCause, msg ++ e.getStackTrace.map(_.toString))
}

Spark 启动器：无法看到失败的 SQL 查询的完整堆栈跟踪

Spark Launcher: Can't see the complete stack trace for failed SQL query

scala

apache-spark

apache-spark-sql

spark-launcher