在 hadoop 的同一分区内合并多个文件的最佳选择？

Question

我在 event_date 上有一个 table 分区，出于某种原因，当我将数据插入外部 table 时，有些日期只有一个或两个文件，而有些超过 200 个。

在启动 Hive 查询以插入数据时，我总是使用这段代码，所以我不确定 where/how 它在某些日期出现问题，但在其他日期则不然。我认为 'merge.tezfiles' 行专门处理插入时的文件合并。

SET mapred.job.queue.name=my_directory;
use this_directory;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2000;
SET hive.exec.max.dynamic.partitions.pernode=2000;
SET hive.merge.tezfiles=true;

我在网上找到的所有内容都提到必须在本地复制文件并再次上传。

有没有办法以干净简单的方式合并每个日期分区中的多个文件？

我在几个分别有 4 个和 15 个文件的日期尝试了以下方法。运行之后的 Hive 输出确认 ext运行eous 文件已被删除，但当我返回查看 Hadoop 时，发现与开始时一样多。幸好我查的时候数据还是准确的，所以我不确定它到底删除了什么？这根本不是正确的命令吗？

alter table table_being_edited PARTITION(event_dt='2017-01-01') CONCATENATE;

这是确认额外文件已被删除的一行：

Moved: 'my_hdfs_filepath/event_dt=2019-10-24/000052_0' to trash at: my_trash_directory/.Trash/Current

好的耗时：75.321 秒

对于有 15 个文件的日期，它给了我类似的输出 15x。

我希望尽可能将许多文件的日期缩小到一两个，因为我们运行超出了命名空间。我对所有这一切都很陌生，所以有没有准系统，在单个日期分区内合并文件的简单方法？

Answer 1

您可以尝试设置以下属性


SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.smallfiles.avgsize=134217728; ( 128 MB)

可以参考这个link

Answer 2

通过在我的其他 SET 配置单元参数之外添加这一行，我能够在将部分文件插入到新的 table 中时，始终如一地将它们合并到一个大小为 5 GB 或更小的文件中：

set hive.merge.smallfiles.avgsize=5000000000;

也可以使用 getmerge 然后将文件放回去，但这需要额外的步骤将文件拉到本地（必须有大量存储，具体取决于文件的大小）这比创建更麻烦一个新的 table，并插入这个附加的 SET 参数。

另一种选择是使用

set hive.merge.mapfiles=true;

里面的参数好像是create no。映射器。如果我们有少量文件，它必须创建那么多映射器，这对于 hadoop 设计来说不是最优的，因此 tez 合并选项更 suitable

Answer 3

如果HDFS/MapR-FS的块大小是256MB，最好将smallfiles.avgsize设置为256MB

SET hive.merge.tezfiles=true; --Merge small files at the end of a Tez DAG.
SET hive.merge.mapfiles=true; --Hive will start an additional map-reduce job to merge the output files into bigger files
SET hive.merge.mapredfiles=true; --Hive will start an additional map-reduce job to merge the output files into bigger files
SET hive.merge.orcfile.stripe.level=true; --When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files.
SET hive.merge.size.per.task=256000000; --Size of merged files at the end of the job.
SET hive.merge.smallfiles.avgsize=256000000; --When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

在 hadoop 的同一分区内合并多个文件的最佳选择？

Best option for merging multiple files within the same partition in hadoop?

hadoop

hive

hdfs

hiveql

hadoop-partitioning