间歇性随机播放时找不到文件导致 Spark 作业崩溃

Spark Job crash due to File Not found when shuffle intermittently

我有几个 Spark 作业,包括批处理作业和 Stream 作业来处理系统日志并分析它们。我们正在使用 Kafka 作为连接每个作业的管道。

升级到 Spark 2.1.0 + Spark Kafka Streaming 010 后,我发现一些作业(批处理或流处理)随机抛出异常(几个小时后 运行 或只是 运行 在 20 分钟内)。谁能给我一些关于如何找出真正根本原因的建议? (貌似很多帖子都在讨论这个,但是解决方案对我来说好像不是很有用...)

这是由于 Spark 配置问题还是代码错误?我无法粘贴我所有的职位代码,因为太多了。

00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6 (No such file or directory) 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native Method) 00:30:04,510 WARN - at java.io.FileOutputStream.open(FileOutputStream.java:270) 00:30:04,510 WARN - at java.io.FileOutputStream.(FileOutputStream.java:213) 00:30:04,510 WARN - at java.io.FileOutputStream.(FileOutputStream.java:162) 00:30:04,510 WARN - at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144) 00:30:04,510 WARN - at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128) 00:30:04,510 WARN - at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 00:30:04,510 WARN - at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 00:30:04,510 WARN - at org.apache.spark.scheduler.Task.run(Task.scala:99) 00:30:04,510 WARN - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748)

我终于找到了根本原因。 Spark Jobs 完全没有问题。 我们有一个 crontab,错误地清理了 /mnt 的临时存储并错误地删除了 spark 缓存文件。