SparkException:Python worker 没有及时连接回来
SparkException: Python worker did not connect back in time
我正在尝试将 Python 作业提交到 2 个工作节点的 Spark 集群,但我一直看到以下问题,最终导致 spark-submit 失败:
15/07/04 21:30:40 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID
2, workernode0.rhom-spark.b9.internal.cloudapp.net):
org.apache.spark.SparkException: Python worker did not connect back in time
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:135)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:64)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:278)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:305)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:278)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:130)
... 15 more
我正在提交包含以下内容的作业
spark-submit --master yarn --py-files tile.py --num-executors 1 --executor-memory 2g main.py
有什么想法吗?
因此,当 python 工作进程无法连接到 spark 执行程序 JVM 时,就会发生这种情况。 Spark 使用套接字与工作进程通信。发生这种情况的原因有很多,确切的具体细节可能会在 executor/worker 机器上的日志中。
我正在尝试将 Python 作业提交到 2 个工作节点的 Spark 集群,但我一直看到以下问题,最终导致 spark-submit 失败:
15/07/04 21:30:40 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID
2, workernode0.rhom-spark.b9.internal.cloudapp.net):
org.apache.spark.SparkException: Python worker did not connect back in time
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:135)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:64)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:278)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:305)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:278)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:130)
... 15 more
我正在提交包含以下内容的作业
spark-submit --master yarn --py-files tile.py --num-executors 1 --executor-memory 2g main.py
有什么想法吗?
因此,当 python 工作进程无法连接到 spark 执行程序 JVM 时,就会发生这种情况。 Spark 使用套接字与工作进程通信。发生这种情况的原因有很多,确切的具体细节可能会在 executor/worker 机器上的日志中。