如何在 Amazon EMR 上找到 spark master URL

Question

我是 spark 的新手，正在尝试在 1.3.1 版的 Amazon 集群上安装 spark。当我做

SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("local[2]");

它确实对我有用，但是我开始知道这是为了测试目的，我可以设置 local[2]

当我尝试使用集群模式时，我将其更改为

SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("spark://localhost:7077");

有了这个我得到了以下错误

试图关联无法访问的远程地址 [akka.tcp://sparkMaster@localhost:7077]。地址现在被门控 5000 毫秒，所有发往该地址的消息都将传送到死信。原因：连接被拒绝 2010 年 6 月 15 日 15:22:21 信息 client.AppClient$ClientActor：连接到主机 akka.tcp://sparkMaster@localhost:7077/user/Master..

谁能告诉我如何设置大师url。

Answer 1

如果您正在使用来自 https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark the configuration is setup for Spark on YARN. So just set master to yarn-client or yarn-cluster. Be sure to define the number of executors with memory and cores. More details about Spark on YARN at https://spark.apache.org/docs/latest/running-on-yarn.html

的 bootstrap 操作

有关内存和内核大小的执行程序设置的补充：

在 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html, specifically yarn.scheduler.maximum-allocation-mb. You can determine the number of cores from the basic EC2 info url (http://aws.amazon.com/ec2/instance-types/). The max size of the executor memory has to fit within the max allocation less Spark's overhead and in increments of 256MB. A good description of this calculation is at http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 查看每种类型的默认 YARN 节点管理器配置。不要忘记执行程序内存的一半多一点可用于 RDD 缓存。

如何在 Amazon EMR 上找到 spark master URL

How to find spark master URL on Amazon EMR

amazon-emr

apache-spark

spark-streaming