Hive count(1) 导致 oom

Hive count(1) leads to oom

我有一个用cdh 6.3搭建的新集群,hive已经准备好了,3个节点有30GB内存。

我创建了一个存储为 parquet 的目标配置单元 table。我把从另一个集群下载的一些parquet文件放到这个hive的HDFS目录下table,当我运行

select count(1) from tableA

我终于显示了:

INFO  : 2021-09-05 14:09:06,505 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 436.69 sec
INFO  : 2021-09-05 14:09:07,520 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU 426.94 sec
INFO  : 2021-09-05 14:09:10,562 Stage-1 map = 94%,  reduce = 0%, Cumulative CPU 464.3 sec
INFO  : 2021-09-05 14:09:26,785 Stage-1 map = 94%,  reduce = 31%, Cumulative CPU 464.73 sec
INFO  : 2021-09-05 14:09:50,112 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 464.3 sec
INFO  : MapReduce Total cumulative CPU time: 7 minutes 44 seconds 300 msec
ERROR : Ended Job = job_1630821050931_0003 with errors
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 18  Reduce: 1   Cumulative CPU: 464.3 sec   HDFS Read: 4352500295 HDFS Write: 0 HDFS EC Read: 0 FAIL
INFO  : Total MapReduce CPU Time Spent: 7 minutes 44 seconds 300 msec
INFO  : Completed executing command(queryId=hive_20210905140833_6a46fec2-91fb-4214-a734-5b76e59a4266); Time taken: 77.981 seconds

查看MR日志,反复显示:

Caused by: java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
    at org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1080)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:712)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
    at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:213)
    at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:101)
    at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:63)
    at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
    ... 16 more

parquet 文件总共只有 4.5 GB,为什么 count() 运行 会 oom?我应该在 MapReduce 中更改什么参数?

有两种方法可以解决映射器中的 OOM:1 - 增加映射器并行度,2 - 增加映射器大小。

首先尝试增加并行度。

检查这些参数的当前值并减少 mapreduce.input.fileinputformat.split.maxsize 以获得更小的映射器:

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapreduce.input.fileinputformat.split.minsize=16000; -- 16 KB files. smaller than min size will be processed on the same mapper combined 
set mapreduce.input.fileinputformat.split.maxsize=128000000; -- 128Mb -files bigger than max size will be splitted. Decrease your setting to get 2x more smaller mappers
--These figures are example only. Compare with yours and decrease accordingly untill you get 2x more mappers

或者尝试增加映射器大小:

set mapreduce.map.memory.mb=4096; --compare with current setting and increase
set mapreduce.map.java.opts=-Xmx3000m; --set ~30% less than mapreduce.map.memory.mb

同时尝试禁用 map-side 聚合(map-side 聚合通常会导致 mapper 上的 OOM)

set hive.map.aggr=false;