如何避免Spark executor迷路，yarn container因内存限制kill掉？

Question

我有以下代码，大多数时候都会触发 hiveContext.sql()。我的任务是我想创建几个 tables 并将值插入到所有 hive table 分区的处理后。

所以我首先触发 show partitions 并在 for 循环中使用它的输出，我调用了一些创建 table 的方法（如果它不存在）并使用hiveContext.sql。

现在，我们不能在执行器中执行hiveContext，所以我必须在驱动程序中的for循环中执行它，并且应该运行一个一个地串行执行。当我在 YARN 集群中提交这个 Spark 作业时，几乎所有时候我的执行程序都会因为 shuffle not found 异常而丢失。

现在发生这种情况是因为 YARN 由于内存过载而杀死了我的执行程序。我不明白为什么，因为每个配置单元分区的数据集都非常小，但它仍然会导致 YARN 杀死我的执行程序。

下面的代码是否会并行执行所有操作并尝试同时在内存中容纳所有 Hive 分区数据？

public static void main(String[] args) throws IOException {   
    SparkConf conf = new SparkConf(); 
    SparkContext sc = new SparkContext(conf); 
    HiveContext hc = new HiveContext(sc); 

    DataFrame partitionFrame = hiveContext.sql(" show partitions dbdata partition(date="2015-08-05")"); 
  
    Row[] rowArr = partitionFrame.collect(); 
    for(Row row : rowArr) { 
        String[] splitArr = row.getString(0).split("/"); 
        String server = splitArr[0].split("=")[1]; 
        String date =  splitArr[1].split("=")[1]; 
        String csvPath = "hdfs:///user/db/ext/"+server+".csv"; 
        if(fs.exists(new Path(csvPath))) { 
            hiveContext.sql("ADD FILE " + csvPath); 
        } 
        createInsertIntoTableABC(hc,entity, date); 
        createInsertIntoTableDEF(hc,entity, date); 
        createInsertIntoTableGHI(hc,entity,date); 
        createInsertIntoTableJKL(hc,entity, date); 
        createInsertIntoTableMNO(hc,entity,date); 
   } 
}

Answer 1

通常，您应该始终深入研究日志以找出真正的异常（至少在 Spark 1.3.1 中）。

tl;dr
Yarn 下 Spark 的安全配置
spark.shuffle.memoryFraction=0.5 - 这将允许随机播放使用更多分配的内存
spark.yarn.executor.memoryOverhead=1024 - 这是以 MB 为单位设置的。当 Yarn 的内存使用量大于 (executor-memory + executor.memoryOverhead)

时，Yarn 会杀死执行者

更多信息

在阅读您的问题时，您提到遇到了 shuffle not found 异常。

万一 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 你应该增加 spark.shuffle.memoryFraction，例如增加到 0.5

Yarn 杀死我的执行程序的最常见原因是内存使用超出预期。为避免增加 spark.yarn.executor.memoryOverhead ，我将其设置为 1024，即使我的执行程序仅使用 2-3G 内存。

Answer 2

这是我的假设：您的集群上的执行程序必须有限，作业可能运行在共享环境中。

正如您所说，您的文件很小，您可以设置较少的执行器数量并增加执行器内核，这里设置 memoryOverhead 属性很重要。

设置执行者数量 = 5
设置执行核心数 = 4
设置内存开销=2G
shuffle partition = 20（使用基于执行器和内核的最大并行度）

使用以上属性我相信您将避免任何执行程序内存不足的问题，而不会影响性能。

如何避免Spark executor迷路，yarn container因内存限制kill掉？

How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?

memory

executors

hadoop-yarn

apache-spark

apache-spark-sql