spark on yarn cluster 创建了一个 spark job,其 worker 数量远小于 spark context 中指定的数量

spark on yarn cluster creates a spark job with the number of workers that is much smaller than what is specified in the spark context

Spark on yarn cluster 创建了一个 spark job,它的 worker 数量比 spark context (100) 中指定的要少得多(只有 4 个 worker): 这是我创建 spark 上下文和会话的方法:

config_list = [
    ('spark.yarn.dist.archives','xxxxxxxxxxx'),
    ('spark.yarn.appMasterEnv.PYSPARK_PYTHON','xxxxxxxxx'),
    ('spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON','xxxxxxxxxxx'),
    ('spark.local.dir','xxxxxxxxxxxxxxxxxx'),
    ('spark.submit.deployMode','client'),
    ('spark.yarn.queue','somequeue'),
    ('spark.dynamicAllocation.minExecutors','100'),
    ('spark.dynamicAllocation.maxExecutors','100'),
    ('spark.executor.instances','100'),
    ('spark.executor.memory','40g'),
    ('spark.driver.memory','40g'),
    ('spark.yarn.executor.memoryOverhead','10g')
]

conf = pyspark.SparkConf().setAll(config_list)

spark = SparkSession.builder.master('yarn')\
    .config(conf=conf)\
    .appName('myapp')\
    .getOrCreate()

sc = spark.sparkContext

如有任何想法,我们将不胜感激

如果您指定的最小工作节点数大于等于实际 workers/executors 存在,则 spark 会话将在您的作业 运行 时分配最大数量的可用工作节点在你的集群中。

您还可以通过使用以下命令查看会话中分配的执行程序数来验证这一点:

sc._conf.get('spark.executor.instances')

希望你明白