Spark 应用程序卡在 运行 状态,初始作业尚未接受任何资源
Spark application stuck in Running state, Initial job has not accepted any resources
我正在使用 Apache Hadoop、Spark 和 DL4J 进行分布式深度学习项目。
我的主要问题是在 spark 上启动我的应用程序时它会进入 运行 状态并且进度永远不会高于 10%
我收到此警告
2019-08-23 20:55:49,198 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
2019-08-23 20:55:49,224 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at saveAsTextFile at BaseTrainingMaster.java:211) (first 15 tasks are for partitions Vector(0, 1))
2019-08-23 20:55:49,226 INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 2 tasks
2019-08-23 20:56:04,286 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:17,526 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:23,135 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
这最后 3 行一直不停
实际上我只有 1 个主节点和 1 个从节点,安装了 Hadoop 和 Spark
- Master 内存为 8GB,配备 intel i5 6500
- Slave 内存为 4GB
英特尔 i3 4400
检查 HDFS 的 WebUI 和日志文件后,我可以看到 HDFS 工作正常
Yarn WebUI 和日志还显示 Yarn 在 1 DATANODE
下工作正常
在这里你可以检查我的代码,看看哪里卡住了
VoidConfiguration config = VoidConfiguration.builder()
.unicastPort(40123)
.networkMask("192.168.0.0/42")
.controllerAddress("192.168.1.35")
.build();
log.log(Level.INFO,"==========After voidconf");
// Create the TrainingMaster instance
TrainingMaster trainingMaster = new SharedTrainingMaster.Builder(config, 1)
.batchSizePerWorker(10)
.workersPerNode(1)
.build();
log.log(Level.INFO,"==========after training master");
SparkDl4jMultiLayer sparkNet = new SparkDl4jMultiLayer(sc, conf, trainingMaster);
log.log(Level.INFO,"==========after sparkMultilayer");
// Execute training:
log.log(Level.INFO,"==========Starting training");
for (int i = 0; i < 100; i++) {
log.log(Level.INFO,"Epoch : " + i); // this is the Last line from my code that is printed in the Log
sparkNet.fit(rddDataSetClassification); //it gets stuck here
log.log(Level.INFO,"Epoch : " + i + " / " + i);
}
log.log(Level.INFO,"after training");
// Dataset Evaluation
Evaluation eval = sparkNet.evaluate(rddDataSetClassification);
log.log(Level.INFO, eval.stats());
纱-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.1.35</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>256</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
火花-defult.conf:
spark.master yarn
spark.driver.memory 2500m
spark.yarn.am.memory 2500m
spark.executor.memory 2000m
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
我怀疑任何资源问题,所以我尝试设置属性,例如设置 spark.executor.cores 和 spark.executor.instances 至 1
我还尝试更改 yarn 和 spark 上下的内存分配(我不确定它是如何工作的)
日志来自 spark.deploy.master..out
2019-08-23 20:18:33,669 INFO master.Master: I have been elected leader! New state: ALIVE
2019-08-23 20:18:40,771 INFO master.Master: Registering worker 192.168.1.37:42869 with 4 cores, 2.8 GB RAM
日志来自 spark.deploy.worker..out
19/08/23 20:18:40 INFO Worker: Connecting to master hadoop-MS-7A75:7077...
19/08/23 20:18:40 INFO TransportClientFactory: Successfully created connection to hadoop-MS-7A75/192.168.1.35:7077 after 115 ms (0 ms spent in bootstraps)
19/08/23 20:18:40 INFO Worker: Successfully registered with master spark://hadoop-MS-7A75:7077
通过添加另一个 Slave 解决了这个问题
我不知道为什么以及如何工作但是当我添加另一个奴隶时它工作了
我正在使用 Apache Hadoop、Spark 和 DL4J 进行分布式深度学习项目。
我的主要问题是在 spark 上启动我的应用程序时它会进入 运行 状态并且进度永远不会高于 10% 我收到此警告
2019-08-23 20:55:49,198 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
2019-08-23 20:55:49,224 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at saveAsTextFile at BaseTrainingMaster.java:211) (first 15 tasks are for partitions Vector(0, 1))
2019-08-23 20:55:49,226 INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 2 tasks
2019-08-23 20:56:04,286 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:17,526 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:23,135 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
这最后 3 行一直不停
实际上我只有 1 个主节点和 1 个从节点,安装了 Hadoop 和 Spark
- Master 内存为 8GB,配备 intel i5 6500
- Slave 内存为 4GB 英特尔 i3 4400
检查 HDFS 的 WebUI 和日志文件后,我可以看到 HDFS 工作正常 Yarn WebUI 和日志还显示 Yarn 在 1 DATANODE
下工作正常在这里你可以检查我的代码,看看哪里卡住了
VoidConfiguration config = VoidConfiguration.builder()
.unicastPort(40123)
.networkMask("192.168.0.0/42")
.controllerAddress("192.168.1.35")
.build();
log.log(Level.INFO,"==========After voidconf");
// Create the TrainingMaster instance
TrainingMaster trainingMaster = new SharedTrainingMaster.Builder(config, 1)
.batchSizePerWorker(10)
.workersPerNode(1)
.build();
log.log(Level.INFO,"==========after training master");
SparkDl4jMultiLayer sparkNet = new SparkDl4jMultiLayer(sc, conf, trainingMaster);
log.log(Level.INFO,"==========after sparkMultilayer");
// Execute training:
log.log(Level.INFO,"==========Starting training");
for (int i = 0; i < 100; i++) {
log.log(Level.INFO,"Epoch : " + i); // this is the Last line from my code that is printed in the Log
sparkNet.fit(rddDataSetClassification); //it gets stuck here
log.log(Level.INFO,"Epoch : " + i + " / " + i);
}
log.log(Level.INFO,"after training");
// Dataset Evaluation
Evaluation eval = sparkNet.evaluate(rddDataSetClassification);
log.log(Level.INFO, eval.stats());
纱-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.1.35</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>256</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
火花-defult.conf:
spark.master yarn
spark.driver.memory 2500m
spark.yarn.am.memory 2500m
spark.executor.memory 2000m
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
我怀疑任何资源问题,所以我尝试设置属性,例如设置 spark.executor.cores 和 spark.executor.instances 至 1 我还尝试更改 yarn 和 spark 上下的内存分配(我不确定它是如何工作的)
日志来自 spark.deploy.master..out
2019-08-23 20:18:33,669 INFO master.Master: I have been elected leader! New state: ALIVE
2019-08-23 20:18:40,771 INFO master.Master: Registering worker 192.168.1.37:42869 with 4 cores, 2.8 GB RAM
日志来自 spark.deploy.worker..out
19/08/23 20:18:40 INFO Worker: Connecting to master hadoop-MS-7A75:7077...
19/08/23 20:18:40 INFO TransportClientFactory: Successfully created connection to hadoop-MS-7A75/192.168.1.35:7077 after 115 ms (0 ms spent in bootstraps)
19/08/23 20:18:40 INFO Worker: Successfully registered with master spark://hadoop-MS-7A75:7077
通过添加另一个 Slave 解决了这个问题 我不知道为什么以及如何工作但是当我添加另一个奴隶时它工作了