Apache Spark YARN 模式启动时间过长（超过 10 秒）

Question

我是运行具有 YARN 客户端或 YARN 集群模式的 spark 应用程序。

但是好像启动时间太长了。

初始化 spark 上下文需要 10 多秒。

这正常吗？还是可以优化一下？

环境如下：

Hadoop：Hortonworks HDP 2.2 (Hadoop 2.6)（具有 3 个数据节点的微型测试集群）
Spark：1.3.1
客户端：Windows 7，但在 CentOS 6.6 上结果相似

以下是应用日志的启动部分。（编辑了部分隐私信息）

第一行的“Main: Initializing context”和最后一行的“MainProcessor: Deleting previous output files”是应用程序的日志。介于两者之间的其他人来自 Spark 本身。显示此日志后执行应用程序逻辑。

15/05/07 09:18:31 INFO Main: Initializing context
15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
15/05/07 09:18:31 INFO Remoting: Starting remoting
15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@mymachine:54449]
15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' on port 54449.
15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity 1956.7 MB
15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SocketConnector@0.0.0.0:54450
15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file server' on port 54450.
15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040
15/05/07 09:18:32 INFO SparkContext: Added JAR file:/D:/Projects/MyApp/MyApp.jar at http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/10.111.111.11:8050
15/05/07 09:18:32 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/05/07 09:18:32 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23040 MB per container)
15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/05/07 09:18:32 INFO Client: Setting up container launch context for our AM
15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
15/05/07 09:18:32 INFO Client: Source and destination file systems are the same. Not copying hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
15/05/07 09:18:32 INFO Client: Setting up the launch environment for our AM container
15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager
15/05/07 09:18:33 INFO YarnClientImpl: Submitted application application_1430956687773_0002
15/05/07 09:18:34 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:34 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1430957906540
     final status: UNDEFINED
     tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
     user: myapp
15/05/07 09:18:35 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:36 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@cluster02:39698/user/YarnAM#-1579648782]
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> cluster01, PROXY_URI_BASES -> http://cluster01:8088/proxy/application_1430956687773_0002), /proxy/application_1430956687773_0002
15/05/07 09:18:37 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/05/07 09:18:38 INFO Client: Application report for application_1430956687773_0002 (state: RUNNING)
15/05/07 09:18:38 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: cluster02
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1430957906540
     final status: UNDEFINED
     tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
     user: myapp
15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application application_1430956687773_0002 has started running.
15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491
15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager
15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, 54491)
15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979] with ID 1
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files

谢谢。

更新

我想我找到了（可能是部分的，但主要的）原因。

在下面几行之间：

15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] with ID 1

我在集群端看日志，发现如下几行：（具体时间和上面那行不一样，是机器的区别）

15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000
15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454

好像是Spark故意睡了5秒

我已经阅读了 Spark 源代码，在 org.apache.spark.deploy.yarn.ApplicationMaster.scala 中，launchReporterThread() 有相关代码。它循环调用 allocator.allocateResources() 和 Thread.sleep()。对于睡眠，它读取配置变量 spark.yarn.scheduler.heartbeat.interval-ms（默认值为 5000，即 5 秒）。根据评论，“我们希望在不对 RM 造成太多请求的情况下做出合理的响应”。因此，除非 YARN 立即完成分配请求，否则似乎会浪费 5 秒。

当我修改配置变量为1000时，只等了1秒

更改后的日志行如下：

15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 1000
15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454

节省了 4 秒。

所以，当一个人不想等待 5 秒时，可以改变 spark.yarn.scheduler.heartbeat.interval-ms.

我希望它产生的额外开销可以忽略不计。

更新

已打开并解决了一个相关的 JIRA 问题。参见 https://issues.apache.org/jira/browse/SPARK-7533

Answer 1

这很典型。我的系统从运行 spark-submit 到获得 SparkContext 大约需要 20 秒。

正如文档中多处所述，解决方案是将您的驱动程序转变为 RPC 服务器。这样你初始化一次，然后其他应用程序可以使用驱动程序的上下文作为服务。

我正在处理我的应用程序。我正在使用 http4s 并将我的驱动程序变成一个网络服务器。

Answer 2

如果您在独立模式下使用 mac os 来运行一些任务，

请记住在您的 system preference -> sharing 中启用 remote ssh connection（我不知道我们为什么要这样做）

在启用之前，我花了大约 1 分钟运行 spark-submit xx.py

启用后，我只用了3秒就运行了

我希望它可以帮助其他人在 Mac os 平台

上遇到此类问题

Answer 3

您可以检查 Apache Livy，它是 Spark 前面的 REST API。

你可以有一个 session and multiple requests 到那个 Spark/Livy 会话。

Answer 4

为了快速创建Spark-Context

在 EMR 上测试：

cd /usr/lib/spark/jars/; zip /tmp/yarn-archive.zip *.jar
cd path/to/folder/of/someOtherDependancy/jarFolder/; zip /tmp/yarn-archive.zip jar-file.jar
zip -Tv /tmp/yarn-archive.zip 用于测试完整性和详细调试
如果 yarn-archive.zip 已经存在于 hdfs 上那么 hdfs dfs -rm -r -f -skipTrash /user/hadoop/yarn-archive.zip hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/ 否则 hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/
--conf spark.yarn.archive="hdfs:///user/hadoop/yarn-archive.zip" 在 spark-submit

之所以可行，是因为主服务器不必将所有 jar 分发给从服务器。他们可以从一些常见的 hdfs 路径获得它，这里是 hdfs:///user/hadoop/yarn-archive.zip.

我意识到它可以节省你3-5秒的时间，这个时间也取决于集群中的节点数量。节点越多，您节省的时间就越多。

Apache Spark YARN 模式启动时间过长（超过 10 秒）

Apache Spark YARN mode startup takes too long (10+ secs)

hadoop

hadoop-yarn

apache-spark

为了快速创建Spark-Context