为什么 hive 中的 map only 作业会产生单个输出文件

Question

当我执行以下查询时，尽管我有 8 个映射器和 0 个缩减器，但我只得到一个文件作为输出。

create table table_2 as select * from table_1.

调用了 8 个映射器，没有缩减器阶段。在table_2的位置只有一个文件，我们有8个mappers和0个reducer，不应该有8个文件吗

Answer 1

来自 Hive 文档，Configuration Properties...

hive.merge.mapfiles
  Default Value: true
  Merge small files at the end of a map-only job.

hive.merge.tezfiles
  Default Value: false
  Merge small files at the end of a Tez DAG

hive.merge.smallfiles.avgsize
  Default Value: 16000000
  When the average output file size of a job is less than this number,
  Hive will start an additional map-reduce job to merge the output files into bigger files...

因此，如果 (a) 您的测试数据集非常小并且 (b) 您不使用 TEZ，而是使用旧的MapReduce，然后 Hive 将运行一个 post-Map 步骤，默认情况下只是合并（中间）结果。

而在 Reduce 步骤之后不会发生这种情况，除非您将 hive.merge.mapredfiles 强制为 true。

为什么 hive 中的 map only 作业会产生单个输出文件

Why does a map only job in hive results in a single output file

hadoop

hive

mapreduce