为什么 Spark Streaming 应用程序在 YARN 上从 Kafka 消费时会停止?

Why would Spark Streaming application stall when consuming from Kafka on YARN?

我正在用 Scala 编写 Spark Streaming 应用程序。该应用程序的目标是使用来自 Kafka 的最新记录并将它们打印到标准输出。

当我 运行 在本地使用 --master local[n] 时,该应用程序运行完美。但是,当我 运行 YARN 中的应用程序(并生成我正在使用的主题)时,应用程序卡在了:

16/11/18 20:53:05 INFO JobScheduler: Added jobs for time 1479502385000 ms

多次重复上述行后,Spark 给出以下错误:

16/11/18 20:54:47 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 9, r3d3.hadoop.REDACTED.REDACTED): java.net.ConnectException: Connection timed out
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
at kafka.consumer.SimpleConsumer.getOrMakeConnection(SimpleConsumer.scala:142)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply$mcV$sp(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply(SimpleConsumer.scala:109)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply$mcV$sp(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at org.apache.spark.util.NextIterator.to(NextIterator.scala:21)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21)
at org.apache.spark.rdd.RDD$$anonfun$collect$$anonfun.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$$anonfun.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

流媒体出错 UI:

org.apache.spark.streaming.dstream.DStream.print(DStream.scala:757)
com.REDACTED.bdp.Main$.main(Main.scala:88)
com.REDACTED.bdp.Main.main(Main.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
org.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:181)
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

来自 YARN 应用程序日志(标准输出)的错误:

java.lang.NullPointerException
        at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(KafkaRDD.scala:158)
        at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:66)
        at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun.apply(KafkaRDD.scala:101)
        at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun.apply(KafkaRDD.scala:101)
        at org.apache.spark.TaskContextImpl$$anon.onTaskCompletion(TaskContextImpl.scala:60)
        at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted.apply(TaskContextImpl.scala:79)
        at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted.apply(TaskContextImpl.scala:77)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
        at org.apache.spark.scheduler.Task.run(Task.scala:91)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-11-21 15:57:49,925] ERROR Exception in task 0.1 in stage 33.0 (TID 34) (org.apache.spark.executor.Executor)
org.apache.spark.util.TaskCompletionListenerException
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:91)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

YARN 应用程序日志中的另一个错误:

[2016-11-21 15:52:32,264] WARN Exception encountered while connecting to the server :  (org.apache.hadoop.ipc.Client)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
        at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:375)
        at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:558)
        at org.apache.hadoop.ipc.Client$Connection.access00(Client.java:373)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:727)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:723)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:722)
        at org.apache.hadoop.ipc.Client$Connection.access00(Client.java:373)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1493)
        at org.apache.hadoop.ipc.Client.call(Client.java:1397)
        at org.apache.hadoop.ipc.Client.call(Client.java:1358)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
        at org.apache.hadoop.hdfs.DistributedFileSystem.doCall(DistributedFileSystem.java:1315)
        at org.apache.hadoop.hdfs.DistributedFileSystem.doCall(DistributedFileSystem.java:1311)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
        at org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$sparkJar(Client.scala:1195)
        at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:1333)
        at org.apache.spark.deploy.yarn.ExecutorRunnable.prepareEnvironment(ExecutorRunnable.scala:290)
        at org.apache.spark.deploy.yarn.ExecutorRunnable.env$lzycompute(ExecutorRunnable.scala:61)
        at org.apache.spark.deploy.yarn.ExecutorRunnable.env(ExecutorRunnable.scala:61)
        at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:80)
        at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:68)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

奇怪的是,无论出于何种原因,大约有 5% 的时间,应用程序成功地从 Kafka 读取数据。

集群和 YARN 似乎工作正常。 使用 Kerberos 保护集群。

此错误的来源可能是什么?

tl;dr 答案未提供答案, 建议下一步可能。

我对何时可以为流作业报告 Lost task 事件的理解是作业何时执行但无法完成,在您的情况下是 Spark 执行程序和 Kafka 代理之间的连接问题.

16/11/18 20:54:47 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 9, r3d3.hadoop.REDACTED.REDACTED): java.net.ConnectException: Connection timed out
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
at kafka.consumer.SimpleConsumer.getOrMakeConnection(SimpleConsumer.scala:142)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply$mcV$sp(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply(SimpleConsumer.scala:109)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply$mcV$sp(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)

pattern of the error message如下:

Lost task [id] in stage [taskSetId] (TID [tid], [host], executor [executorId]): [reason]

将您的情况转化为在主机 r3d3.hadoop.REDACTED.REDACTED.

上拥有 Spark 执行程序 运行

失败的原因如下:

java.net.ConnectException: Connection timed out
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:454)
at sun.nio.ch.Net.connect(Net.java:446)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
at kafka.consumer.SimpleConsumer.getOrMakeConnection(SimpleConsumer.scala:142)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply$mcV$sp(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply(SimpleConsumer.scala:109)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$$anonfun$apply$mcV$sp.apply(SimpleConsumer.scala:109)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply$mcV$sp(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply(SimpleConsumer.scala:108)
at kafka.consumer.SimpleConsumer$$anonfun$fetch.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)

我会问自己什么时候 Kafka 代理对客户端不可用(在您的情况下是 Spark Streaming 应用程序,它可能有助于或可能不会有助于理解问题的根本原因)。

我认为它可能与 Apache Spark 无关,会在 Kafka 圈子中寻找更多答案。