Hadoop Distcp 是否在块级别复制？

Does Hadoop Distcp copy at block level?

Distcp between/within 集群是 Map-Reduce 作业。我的假设是，它在输入拆分级别上复制文件，有助于提高复制性能，因为一个文件将由多个并行处理多个 "pieces" 的映射器复制。然而，当我浏览 Hadoop Distcp 的文档时，似乎 Distcp 只能在文件级别工作。请参考这里：hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

根据 distcp 文档，distcp 只会拆分文件列表，而不是文件本身，并将列表的分区交给映射器。

谁能告诉我这究竟是如何工作的？

附加问题：如果一个文件只分配给一个映射器，映射器如何找到它运行所在的一个节点上的所有输入拆分？

对于 ~50G 大小的单个文件，将触发 1 个映射任务来复制数据，因为文件是 Distcp 中最精细的粒度级别。

引用自documentation：

Why does DistCp not run faster when more maps are specified?

At present, the smallest unit of work for DistCp is a file. i.e., a file is processed by only one map. Increasing the number of maps to a value exceeding the number of files would yield no performance benefit. The number of maps launched would equal the number of files.

更新
文件的块位置是在 mapreduce 期间从名称节点获取的。在 Distcp 上，如果可能，每个 Mapper 都将在文件第一个块所在的节点上启动。在文件由多个拆分组成的情况下，如果在同一节点上不可用，则会从附近获取它们。

Hadoop Distcp 是否在块级别复制？

Does Hadoop Distcp copy at block level?

hadoop

cluster-computing

hdfs

distcp