Spark 启动器:无法看到失败的 SQL 查询的完整堆栈跟踪

Spark Launcher: Can't see the complete stack trace for failed SQL query

我正在使用 SparkLauncher 在 Yarn 之上以集群模式连接到 Spark。我正在 运行 使用 Scala 编写一些 SQL 代码,如下所示:

def execute(code: String): Unit = {
    try {
      val resultDataframe = spark.sql(code)
      resultDataframe.write.json("s3://some/prefix")
    catch {
      case NonFatal(f) =>
        log.warn(s"Fail to execute query $code", f)
        log.info(f.getMessage, getNestedStackTrace(f, Seq[String]()))
    } 
}

def getNestedStackTrace(e: Throwable, msg: Seq[String]): Seq[String] = {
   if (e.getCause == null) return msg
   getNestedStackTrace(e.getCause, msg ++ e.getStackTrace.map(_.toString))
}

现在,当我 运行 使用 execute() 方法应该失败的查询时,例如,查询没有分区谓词的分区 table - select * from partitioned_table_on_dt limit 1;,我返回不正确的堆栈跟踪。

当我从 spark-shell:[=38= 手动 运行 spark.sql(code).write.json() 时更正堆栈跟踪]

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(1) LocalLimit 1
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
    at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)
...

Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: No partition predicate found for partitioned table
 partitioned_table_on_dt.
 If the table is cached in memory then turn off this check by setting
 hive.mapred.mode to nonstrict
    at org.apache.spark.sql.hive.execution.HiveTableScanExec.prunePartitions(HiveTableScanExec.scala:155)
...

org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
...

上面 execute() 方法的错误堆栈跟踪:

Job Aborted: 
"org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)",
"org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)",
"org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)",
...

"org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)",
"org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)",
...

spark-shell 堆栈跟踪有三个嵌套异常 SparkException(SemanticException (TreeNodeException)) 但我在代码中看到的回溯是仅来自 SparkExceptionTreeNodeException,但即使在 getNestedStackTrace() 方法中获取嵌套堆栈跟踪后,最有价值的 SemanticException 回溯也丢失了。

任何 Spark/Scala 专家都可以告诉我我做错了什么,或者我如何在此处获取完整的堆栈跟踪,但没有所有异常?

递归方法 getNestedStackTrace() 有一个错误。

def getNestedStackTrace(e: Throwable, msg: Seq[String]): Seq[String] = {
   if (e == null) return msg // this should be e not e.getCause  
   getNestedStackTrace(e.getCause, msg ++ e.getStackTrace.map(_.toString))
}