Apache Spark YARN 模式启动时间过长(超过 10 秒)
Apache Spark YARN mode startup takes too long (10+ secs)
我是 运行 具有 YARN 客户端或 YARN 集群模式的 spark 应用程序。
但是好像启动时间太长了。
初始化 spark 上下文需要 10 多秒。
这正常吗?还是可以优化一下?
环境如下:
- Hadoop:Hortonworks HDP 2.2 (Hadoop 2.6)(具有 3 个数据节点的微型测试集群)
- Spark:1.3.1
- 客户端:Windows 7,但在 CentOS 6.6 上结果相似
以下是应用日志的启动部分。 (编辑了部分隐私信息)
第一行的“Main: Initializing context”和最后一行的“MainProcessor: Deleting previous output files”是应用程序的日志。介于两者之间的其他人来自 Spark 本身。显示此日志后执行应用程序逻辑。
15/05/07 09:18:31 INFO Main: Initializing context
15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
15/05/07 09:18:31 INFO Remoting: Starting remoting
15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@mymachine:54449]
15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' on port 54449.
15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity 1956.7 MB
15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SocketConnector@0.0.0.0:54450
15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file server' on port 54450.
15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040
15/05/07 09:18:32 INFO SparkContext: Added JAR file:/D:/Projects/MyApp/MyApp.jar at http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/10.111.111.11:8050
15/05/07 09:18:32 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/05/07 09:18:32 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23040 MB per container)
15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/05/07 09:18:32 INFO Client: Setting up container launch context for our AM
15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
15/05/07 09:18:32 INFO Client: Source and destination file systems are the same. Not copying hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
15/05/07 09:18:32 INFO Client: Setting up the launch environment for our AM container
15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager
15/05/07 09:18:33 INFO YarnClientImpl: Submitted application application_1430956687773_0002
15/05/07 09:18:34 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:34 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1430957906540
final status: UNDEFINED
tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
user: myapp
15/05/07 09:18:35 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:36 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@cluster02:39698/user/YarnAM#-1579648782]
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> cluster01, PROXY_URI_BASES -> http://cluster01:8088/proxy/application_1430956687773_0002), /proxy/application_1430956687773_0002
15/05/07 09:18:37 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/05/07 09:18:38 INFO Client: Application report for application_1430956687773_0002 (state: RUNNING)
15/05/07 09:18:38 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: cluster02
ApplicationMaster RPC port: 0
queue: default
start time: 1430957906540
final status: UNDEFINED
tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
user: myapp
15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application application_1430956687773_0002 has started running.
15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491
15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager
15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, 54491)
15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979] with ID 1
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files
谢谢。
更新
我想我找到了(可能是部分的,但主要的)原因。
在下面几行之间:
15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] with ID 1
我在集群端看日志,发现如下几行:(具体时间和上面那行不一样,是机器的区别)
15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000
15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454
好像是Spark故意睡了5秒
我已经阅读了 Spark 源代码,在 org.apache.spark.deploy.yarn.ApplicationMaster.scala 中,launchReporterThread() 有相关代码。
它循环调用 allocator.allocateResources() 和 Thread.sleep()。
对于睡眠,它读取配置变量 spark.yarn.scheduler.heartbeat.interval-ms(默认值为 5000,即 5 秒)。
根据评论,“我们希望在不对 RM 造成太多请求的情况下做出合理的响应”。
因此,除非 YARN 立即完成分配请求,否则似乎会浪费 5 秒。
当我修改配置变量为1000时,只等了1秒
更改后的日志行如下:
15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 1000
15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454
节省了 4 秒。
所以,当一个人不想等待 5 秒时,可以改变 spark.yarn.scheduler.heartbeat.interval-ms.
我希望它产生的额外开销可以忽略不计。
更新
已打开并解决了一个相关的 JIRA 问题。参见 https://issues.apache.org/jira/browse/SPARK-7533
这很典型。我的系统从 运行 spark-submit 到获得 SparkContext 大约需要 20 秒。
正如文档中多处所述,解决方案是将您的驱动程序转变为 RPC 服务器。这样你初始化一次,然后其他应用程序可以使用驱动程序的上下文作为服务。
我正在处理我的应用程序。我正在使用 http4s 并将我的驱动程序变成一个网络服务器。
如果您在独立模式下使用 mac os 来 运行 一些任务,
请记住在您的 system preference -> sharing
中启用 remote ssh connection
(我不知道我们为什么要这样做)
在启用之前,我花了大约 1 分钟 运行 spark-submit xx.py
启用后,我只用了3秒就运行了
我希望它可以帮助其他人在 Mac os 平台
上遇到此类问题
您可以检查 Apache Livy,它是 Spark 前面的 REST API。
你可以有一个 session and multiple requests 到那个 Spark/Livy 会话。
为了快速创建Spark-Context
在 EMR 上测试:
cd /usr/lib/spark/jars/; zip /tmp/yarn-archive.zip *.jar
cd path/to/folder/of/someOtherDependancy/jarFolder/; zip /tmp/yarn-archive.zip jar-file.jar
zip -Tv /tmp/yarn-archive.zip
用于测试完整性和详细调试
如果 yarn-archive.zip
已经存在于 hdfs 上那么 hdfs dfs -rm -r -f -skipTrash /user/hadoop/yarn-archive.zip
hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/
否则 hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/
--conf spark.yarn.archive="hdfs:///user/hadoop/yarn-archive.zip"
在 spark-submit
中使用此参数
之所以可行,是因为主服务器不必将所有 jar 分发给从服务器。他们可以从一些常见的 hdfs 路径获得它,这里是 hdfs:///user/hadoop/yarn-archive.zip
.
我意识到它可以节省你3-5秒的时间,这个时间也取决于集群中的节点数量。节点越多,您节省的时间就越多。
我是 运行 具有 YARN 客户端或 YARN 集群模式的 spark 应用程序。
但是好像启动时间太长了。
初始化 spark 上下文需要 10 多秒。
这正常吗?还是可以优化一下?
环境如下:
- Hadoop:Hortonworks HDP 2.2 (Hadoop 2.6)(具有 3 个数据节点的微型测试集群)
- Spark:1.3.1
- 客户端:Windows 7,但在 CentOS 6.6 上结果相似
以下是应用日志的启动部分。 (编辑了部分隐私信息)
第一行的“Main: Initializing context”和最后一行的“MainProcessor: Deleting previous output files”是应用程序的日志。介于两者之间的其他人来自 Spark 本身。显示此日志后执行应用程序逻辑。
15/05/07 09:18:31 INFO Main: Initializing context
15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
15/05/07 09:18:31 INFO Remoting: Starting remoting
15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@mymachine:54449]
15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' on port 54449.
15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity 1956.7 MB
15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SocketConnector@0.0.0.0:54450
15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file server' on port 54450.
15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040
15/05/07 09:18:32 INFO SparkContext: Added JAR file:/D:/Projects/MyApp/MyApp.jar at http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/10.111.111.11:8050
15/05/07 09:18:32 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/05/07 09:18:32 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23040 MB per container)
15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/05/07 09:18:32 INFO Client: Setting up container launch context for our AM
15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
15/05/07 09:18:32 INFO Client: Source and destination file systems are the same. Not copying hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
15/05/07 09:18:32 INFO Client: Setting up the launch environment for our AM container
15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager
15/05/07 09:18:33 INFO YarnClientImpl: Submitted application application_1430956687773_0002
15/05/07 09:18:34 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:34 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1430957906540
final status: UNDEFINED
tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
user: myapp
15/05/07 09:18:35 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:36 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@cluster02:39698/user/YarnAM#-1579648782]
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> cluster01, PROXY_URI_BASES -> http://cluster01:8088/proxy/application_1430956687773_0002), /proxy/application_1430956687773_0002
15/05/07 09:18:37 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/05/07 09:18:38 INFO Client: Application report for application_1430956687773_0002 (state: RUNNING)
15/05/07 09:18:38 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: cluster02
ApplicationMaster RPC port: 0
queue: default
start time: 1430957906540
final status: UNDEFINED
tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
user: myapp
15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application application_1430956687773_0002 has started running.
15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491
15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager
15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, 54491)
15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979] with ID 1
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files
谢谢。
更新
我想我找到了(可能是部分的,但主要的)原因。
在下面几行之间:
15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] with ID 1
我在集群端看日志,发现如下几行:(具体时间和上面那行不一样,是机器的区别)
15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000
15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454
好像是Spark故意睡了5秒
我已经阅读了 Spark 源代码,在 org.apache.spark.deploy.yarn.ApplicationMaster.scala 中,launchReporterThread() 有相关代码。 它循环调用 allocator.allocateResources() 和 Thread.sleep()。 对于睡眠,它读取配置变量 spark.yarn.scheduler.heartbeat.interval-ms(默认值为 5000,即 5 秒)。 根据评论,“我们希望在不对 RM 造成太多请求的情况下做出合理的响应”。 因此,除非 YARN 立即完成分配请求,否则似乎会浪费 5 秒。
当我修改配置变量为1000时,只等了1秒
更改后的日志行如下:
15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 1000
15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454
节省了 4 秒。
所以,当一个人不想等待 5 秒时,可以改变 spark.yarn.scheduler.heartbeat.interval-ms.
我希望它产生的额外开销可以忽略不计。
更新
已打开并解决了一个相关的 JIRA 问题。参见 https://issues.apache.org/jira/browse/SPARK-7533
这很典型。我的系统从 运行 spark-submit 到获得 SparkContext 大约需要 20 秒。
正如文档中多处所述,解决方案是将您的驱动程序转变为 RPC 服务器。这样你初始化一次,然后其他应用程序可以使用驱动程序的上下文作为服务。
我正在处理我的应用程序。我正在使用 http4s 并将我的驱动程序变成一个网络服务器。
如果您在独立模式下使用 mac os 来 运行 一些任务,
请记住在您的 system preference -> sharing
中启用 remote ssh connection
(我不知道我们为什么要这样做)
在启用之前,我花了大约 1 分钟 运行 spark-submit xx.py
启用后,我只用了3秒就运行了
我希望它可以帮助其他人在 Mac os 平台
上遇到此类问题您可以检查 Apache Livy,它是 Spark 前面的 REST API。
你可以有一个 session and multiple requests 到那个 Spark/Livy 会话。
为了快速创建Spark-Context
在 EMR 上测试:
cd /usr/lib/spark/jars/; zip /tmp/yarn-archive.zip *.jar
cd path/to/folder/of/someOtherDependancy/jarFolder/; zip /tmp/yarn-archive.zip jar-file.jar
zip -Tv /tmp/yarn-archive.zip
用于测试完整性和详细调试如果
yarn-archive.zip
已经存在于 hdfs 上那么hdfs dfs -rm -r -f -skipTrash /user/hadoop/yarn-archive.zip
hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/
否则hdfs dfs -put /tmp/yarn-archive.zip /user/hadoop/
--conf spark.yarn.archive="hdfs:///user/hadoop/yarn-archive.zip"
在spark-submit
中使用此参数
之所以可行,是因为主服务器不必将所有 jar 分发给从服务器。他们可以从一些常见的 hdfs 路径获得它,这里是 hdfs:///user/hadoop/yarn-archive.zip
.
我意识到它可以节省你3-5秒的时间,这个时间也取决于集群中的节点数量。节点越多,您节省的时间就越多。