在将作业提交给执行者之前引发长时间延迟

spark long delay before submitting jobs to the executors

我正在通过 spark-sql 使用 spark-Cassandra 驱动程序来查询我的 Cassandra 集群。每个 Cassandra 节点都有一个 spark worker(位于同一地点)。

问题:在将任务提交给执行程序之前有很长的延迟(基于网络 UI 和驱动程序日志上的时间戳)。查询是一个简单的 select,它指定所有 cassandra 分区键并包含两个阶段和两个任务。以前,查询在另一台具有并置驱动程序和主服务器的服务器上花费了 300 毫秒。

但我必须将我的应用程序和 spark master 移动到另一台服务器(与以前相同,但只是在另一台物理服务器上),现在查询需要 40 秒。虽然任务持续时间大约是 7 秒,但 Job 花了 40 秒,我不明白额外的延迟是什么?

我还用与 Cassandra 没有任何联系的工作检查了 spark,它花了 200 毫秒,所以我认为它与 spark-cassandra 的关系比与 spark 本身的关系更大。

这是作业执行期间的火花日志:

[INFO ] 2019-03-04 06:59:07.067 [qtp1151421920-470] SparkSqlParser 54 - Parsing command: select * from ...
[INFO ] 2019-03-04 06:59:07.276 [qtp1151421920-470] CassandraSourceRelation 35 - Input Predicates: ...
[INFO ] 2019-03-04 06:59:07.279 [qtp1151421920-470] ClockFactory 52 - Using native clock to generate timestamps.
[INFO ] 2019-03-04 06:59:07.439 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.201:9042 added
[INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.202:9042 added
[INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.203:9042 added
[INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.204:9042 added
[INFO ] 2019-03-04 06:59:07.446 [qtp1151421920-470] CassandraConnector 35 - Connected to Cassandra cluster: Digger Cluster
[INFO ] 2019-03-04 06:59:07.526 [qtp1151421920-470] CassandraSourceRelation 35 - Input Predicates: ...
[INFO ] 2019-03-04 06:59:07.848 [qtp1151421920-470] CodeGenerator 54 - Code generated in 120.31952 ms
[INFO ] 2019-03-04 06:59:08.264 [qtp1151421920-470] CodeGenerator 54 - Code generated in 15.084165 ms
[INFO ] 2019-03-04 06:59:08.289 [qtp1151421920-470] CodeGenerator 54 - Code generated in 17.893182 ms
[INFO ] 2019-03-04 06:59:08.379 [qtp1151421920-470] SparkContext 54 - Starting job: collectAsList at MyClass.java:5
[INFO ] 2019-03-04 06:59:08.394 [dag-scheduler-event-loop] DAGScheduler 54 - Registering RDD 12 (toJSON at MyClass.java.java:5)
[INFO ] 2019-03-04 06:59:08.397 [dag-scheduler-event-loop] DAGScheduler 54 - Got job 0 (collectAsList at MyClass.java.java:5) with 1 output partitions
[INFO ] 2019-03-04 06:59:08.398 [dag-scheduler-event-loop] DAGScheduler 54 - Final stage: ResultStage 1 (collectAsList at MyClass.java.java:5)
[INFO ] 2019-03-04 06:59:08.398 [dag-scheduler-event-loop] DAGScheduler 54 - Parents of final stage: List(ShuffleMapStage 0)
[INFO ] 2019-03-04 06:59:08.400 [dag-scheduler-event-loop] DAGScheduler 54 - Missing parents: List(ShuffleMapStage 0)
[INFO ] 2019-03-04 06:59:08.405 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[12] at toJSON at MyClass.java.java:5), which has no missing parents
[INFO ] 2019-03-04 06:59:15.703 [pool-44-thread-1] CassandraConnector 35 - Disconnected from Cassandra cluster: Digger Cluster
-----------------long delay here
[INFO ] 2019-03-04 06:59:43.547 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_0 stored as values in memory (estimated size 20.6 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:43.579 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.5 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:43.581 [dispatcher-event-loop-1] BlockManagerInfo 54 - Added broadcast_0_piece0 in memory on 192.168.1.94:38311 (size: 9.5 KB, free: 17.8 GB)
[INFO ] 2019-03-04 06:59:43.584 [dag-scheduler-event-loop] SparkContext 54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1006
[INFO ] 2019-03-04 06:59:43.597 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[12] at toJSON at MyClass.java.java:5) (first 15 tasks are for partitions Vector(0))
[INFO ] 2019-03-04 06:59:43.598 [dag-scheduler-event-loop] TaskSchedulerImpl 54 - Adding task set 0.0 with 1 tasks
[INFO ] 2019-03-04 06:59:43.619 [dag-scheduler-event-loop] FairSchedulableBuilder 54 - Added task set TaskSet_0.0 tasks to pool rest
[INFO ] 2019-03-04 06:59:43.652 [dispatcher-event-loop-35] TaskSetManager 54 - Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.210, executor 11, partition 0, NODE_LOCAL, 6357 bytes)
[INFO ] 2019-03-04 06:59:43.920 [dispatcher-event-loop-36] BlockManagerInfo 54 - Added broadcast_0_piece0 in memory on 192.168.1.210:42612 (size: 9.5 KB, free: 912.3 MB)
[INFO ] 2019-03-04 06:59:46.591 [task-result-getter-0] TaskSetManager 54 - Finished task 0.0 in stage 0.0 (TID 0) in 2963 ms on 192.168.1.210 (executor 11) (1/1)
[INFO ] 2019-03-04 06:59:46.594 [task-result-getter-0] TaskSchedulerImpl 54 - Removed TaskSet 0.0, whose tasks have all completed, from pool rest
[INFO ] 2019-03-04 06:59:46.601 [dag-scheduler-event-loop] DAGScheduler 54 - ShuffleMapStage 0 (toJSON at MyClass.java.java:5) finished in 2.981 s
[INFO ] 2019-03-04 06:59:46.602 [dag-scheduler-event-loop] DAGScheduler 54 - looking for newly runnable stages
[INFO ] 2019-03-04 06:59:46.603 [dag-scheduler-event-loop] DAGScheduler 54 - running: Set()
[INFO ] 2019-03-04 06:59:46.603 [dag-scheduler-event-loop] DAGScheduler 54 - waiting: Set(ResultStage 1)
[INFO ] 2019-03-04 06:59:46.604 [dag-scheduler-event-loop] DAGScheduler 54 - failed: Set()
[INFO ] 2019-03-04 06:59:46.608 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting ResultStage 1 (MapPartitionsRDD[18] at collectAsList at MyClass.java.java:5), which has no missing parents
[INFO ] 2019-03-04 06:59:46.615 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_1 stored as values in memory (estimated size 20.8 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:46.618 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.8 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:46.619 [dispatcher-event-loop-21] BlockManagerInfo 54 - Added broadcast_1_piece0 in memory on 192.168.1.94:38311 (size: 9.8 KB, free: 17.8 GB)
[INFO ] 2019-03-04 06:59:46.620 [dag-scheduler-event-loop] SparkContext 54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1006
[INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[18] at collectAsList at MyClass.java.java:5) (first 15 tasks are for partitions Vector(0))
[INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] TaskSchedulerImpl 54 - Adding task set 1.0 with 1 tasks
[INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] FairSchedulableBuilder 54 - Added task set TaskSet_1.0 tasks to pool rest
[INFO ] 2019-03-04 06:59:46.627 [dispatcher-event-loop-25] TaskSetManager 54 - Starting task 0.0 in stage 1.0 (TID 1, 192.168.1.212, executor 9, partition 0, PROCESS_LOCAL, 4730 bytes)
[INFO ] 2019-03-04 06:59:46.851 [dispatcher-event-loop-9] BlockManagerInfo 54 - Added broadcast_1_piece0 in memory on 192.168.1.212:43471 (size: 9.8 KB, free: 912.3 MB)
[INFO ] 2019-03-04 06:59:47.257 [dispatcher-event-loop-38] MapOutputTrackerMasterEndpoint 54 - Asked to send map output locations for shuffle 0 to 192.168.1.212:46794
[INFO ] 2019-03-04 06:59:47.262 [map-output-dispatcher-0] MapOutputTrackerMaster 54 - Size of output statuses for shuffle 0 is 141 bytes
[INFO ] 2019-03-04 06:59:47.763 [task-result-getter-1] TaskSetManager 54 - Finished task 0.0 in stage 1.0 (TID 1) in 1140 ms on 192.168.1.212 (executor 9) (1/1)
[INFO ] 2019-03-04 06:59:47.763 [task-result-getter-1] TaskSchedulerImpl 54 - Removed TaskSet 1.0, whose tasks have all completed, from pool rest
[INFO ] 2019-03-04 06:59:47.765 [dag-scheduler-event-loop] DAGScheduler 54 - ResultStage 1 (collectAsList at MyClass.java.java:5) finished in 1.142 s
[INFO ] 2019-03-04 06:59:47.771 [qtp1151421920-470] DAGScheduler 54 - Job 0 finished: collectAsList at MyClass.java.java:5, took 39.391066 s

[INFO ] 2019-03-04 07:00:09.014 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 4
[INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 0
[INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 3
[INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 1
[INFO ] 2019-03-04 07:00:09.028 [dispatcher-event-loop-10] BlockManagerInfo 54 - Removed broadcast_1_piece0 on 192.168.1.94:38311 in memory (size: 9.8 KB, free: 17.8 GB)
[INFO ] 2019-03-04 07:00:09.045 [dispatcher-event-loop-0] BlockManagerInfo 54 - Removed broadcast_1_piece0 on 192.168.1.212:43471 in memory (size: 9.8 KB, free: 912.3 MB)
[INFO ] 2019-03-04 07:00:09.063 [Spark Context Cleaner] ContextCleaner 54 - Cleaned shuffle 0
[INFO ] 2019-03-04 07:00:09.065 [dispatcher-event-loop-16] BlockManagerInfo 54 - Removed broadcast_0_piece0 on 192.168.1.94:38311 in memory (size: 9.5 KB, free: 17.8 GB)
[INFO ] 2019-03-04 07:00:09.071 [dispatcher-event-loop-37] BlockManagerInfo 54 - Removed broadcast_0_piece0 on 192.168.1.210:42612 in memory (size: 9.5 KB, free: 912.3 MB)
[INFO ] 2019-03-04 07:00:09.074 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 2

还附上作业的 spark web ui 的屏幕截图,其 tasks.Logs 和图像不是同一作业的。

P.S:spark-cassandra 连接器是否在我每次 运行 查询时创建一个新会话(我每次都看到连接-断开到 cassandra 集群)?我 运行 多个并行查询,这不会比 pure-cassandra 慢很多吗?

spark job

检查 jvisualvm,Executors 在时间间隔内没有 activity,但驱动程序(我的应用程序)只有在时间间隔内有一个名为 "dag-scheduler..." 运行 的线程。线程转储说它卡在 InetAddress.getHostName() 上。

然后在调试模式下,我在那里放置了一个断点,发现它正在尝试为我的所有 cassandra 集群反向查找(ip 到主机名),所以刚刚添加了所有 [=16=我的 cassandra 集群的 ]s 到 /etc/hosts 的末尾,问题解决了!