在 EMR 上使用 Sparkling Water 进行的 GBM 训练因数据量增加而失败
GBM training with Sparkling Water on EMR failing with increased data size
我正在尝试使用 Sparkling Water 在具有 60 个 c4.8xlarge 节点的 EMR 集群上训练 GBM。该过程成功运行到特定的数据大小。一旦我达到一定的数据大小(训练示例的数量),该过程就会在 SpreadRDDBuilder.scala 的收集阶段冻结,并在一小时后死亡。当这种情况发生时,网络内存继续增长到容量,而 Spark 阶段没有进展(见下文)并且 CPU 使用率和网络流量很少。我试过增加执行程序和驱动程序的内存以及执行程序的数量,但我在所有配置下都看到了完全相同的行为。
感谢您查看此内容。这是我第一次在这里发帖,如果您需要更多信息,请告诉我。
参数
spark-submit --num-executors 355 --driver-class-path h2o-genmodel-3.10.1.2.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* --driver-memory 20G --executor-memory 10G --conf spark.sql.shuffle.partitions=10000 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --driver-java-options -Dlog4j.configuration=file:${PWD}/log4j.xml --conf spark.ext.h2o.repl.enabled=false --conf spark.dynamicAllocation.enabled=false --conf spark.locality.wait=3000 --class com.X.X.X.Main X.jar -i s3a://x
我尝试过但没有成功的其他参数:
conf spark.ext.h2o.topology.change.listener.enabled=false
conf spark.scheduler.minRegisteredResourcesRatio=1
conf spark.task.maxFailures=1
conf spark.yarn.max.executor.failures=1
火花UI
collect at SpreadRDDBuilder.scala:105 118/3551
collect at SpreadRDDBuilder.scala:105 109/3551
collect at SpreadRDDBuilder.scala:105 156/3551
collect at SpreadRDDBuilder.scala:105 151/3551
collect at SpreadRDDBuilder.scala:105 641/3551
驱动日志
17/02/13 22:43:39 WARN LiveListenerBus: Dropped 49459 SparkListenerEvents since Mon Feb 13 22:42:39 UTC 2017
[Stage 9:(641 + 1043) / 3551][Stage 10:(151 + 236) / 3551][Stage 11:(156 + 195) / 3551]
纱线容器的标准错误
t.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:34 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(222,[Lscala.Tuple2;@c7ac58,BlockManagerId(222, ip-172-31-25-18.ec2.internal, 36644))]
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:119)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply$mcV$sp(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
at org.apache.spark.executor.Executor$$anon.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
... 13 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8189382742475673817 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 7998046565668775240 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8944638230411142855 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
问题在于将非常高的基数(数亿个唯一值)字符串列转换为枚举。从数据框中删除这些列解决了这个问题。有关更多详细信息,请参阅:https://community.h2o.ai/questions/1747/gbm-training-with-sparkling-water-on-emr-failing-w.html
我正在尝试使用 Sparkling Water 在具有 60 个 c4.8xlarge 节点的 EMR 集群上训练 GBM。该过程成功运行到特定的数据大小。一旦我达到一定的数据大小(训练示例的数量),该过程就会在 SpreadRDDBuilder.scala 的收集阶段冻结,并在一小时后死亡。当这种情况发生时,网络内存继续增长到容量,而 Spark 阶段没有进展(见下文)并且 CPU 使用率和网络流量很少。我试过增加执行程序和驱动程序的内存以及执行程序的数量,但我在所有配置下都看到了完全相同的行为。
感谢您查看此内容。这是我第一次在这里发帖,如果您需要更多信息,请告诉我。
参数
spark-submit --num-executors 355 --driver-class-path h2o-genmodel-3.10.1.2.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* --driver-memory 20G --executor-memory 10G --conf spark.sql.shuffle.partitions=10000 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --driver-java-options -Dlog4j.configuration=file:${PWD}/log4j.xml --conf spark.ext.h2o.repl.enabled=false --conf spark.dynamicAllocation.enabled=false --conf spark.locality.wait=3000 --class com.X.X.X.Main X.jar -i s3a://x
我尝试过但没有成功的其他参数:
conf spark.ext.h2o.topology.change.listener.enabled=false
conf spark.scheduler.minRegisteredResourcesRatio=1
conf spark.task.maxFailures=1
conf spark.yarn.max.executor.failures=1
火花UI
collect at SpreadRDDBuilder.scala:105 118/3551
collect at SpreadRDDBuilder.scala:105 109/3551
collect at SpreadRDDBuilder.scala:105 156/3551
collect at SpreadRDDBuilder.scala:105 151/3551
collect at SpreadRDDBuilder.scala:105 641/3551
驱动日志
17/02/13 22:43:39 WARN LiveListenerBus: Dropped 49459 SparkListenerEvents since Mon Feb 13 22:42:39 UTC 2017
[Stage 9:(641 + 1043) / 3551][Stage 10:(151 + 236) / 3551][Stage 11:(156 + 195) / 3551]
纱线容器的标准错误
t.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:34 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(222,[Lscala.Tuple2;@c7ac58,BlockManagerId(222, ip-172-31-25-18.ec2.internal, 36644))]
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:119)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply$mcV$sp(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1953)
at org.apache.spark.executor.Executor$$anon.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
... 13 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8189382742475673817 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 7998046565668775240 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
17/02/13 22:56:41 WARN TransportResponseHandler: Ignoring response for RPC 8944638230411142855 from /172.31.27.164:37563 (81 bytes) since it is not outstanding
问题在于将非常高的基数(数亿个唯一值)字符串列转换为枚举。从数据框中删除这些列解决了这个问题。有关更多详细信息,请参阅:https://community.h2o.ai/questions/1747/gbm-training-with-sparkling-water-on-emr-failing-w.html