hadoop:when文件小于64M,增加节点数对处理速度有影响吗?

hadoop:when the file is less than 64M,does increasing the node number have an effect on the processing speed?

我知道默认块大小是64M,分割是64M, 那么对于小于64M的文件,当节点数从1增加到6时,只有一个节点做split,速度不会提升?那正确吗? 如果是128M的文件,2个分片会有2个节点,速度比1个节点快,如果超过3个节点,速度没有提升,是吗?

不知道我的理解对不对correct.Thanks求指教!

您假设一个大文件一开始是可拆分的,但情况并非总是如此。

如果您的文件小于块大小,添加更多节点永远不会增加处理时间,它只会有助于复制和总集群容量。

否则,您的理解似乎是正确的,不过,我认为最新的默认值实际上是 128 MB,而不是 64

这是您查询的答案

I know the default block size is 64M,

hadoop 1.0 版默认大小为 64MB,2.0 版默认大小为 128MB。可以通过在配置文件 hdfs-site.xml 中设置参数 dfs.block.size 的值来覆盖默认块大小。

split is 64M,

没有必要,因为块大小与拆分大小不同。 为了更清楚。对于正常的 wordcount 示例程序,我们可以安全地假设拆分大小 大约 与块大小相同。

then for files less than 64M , when the number of nodes increase from 1 to 6 , there will be only one node to do with the split, so the speed will not improve? Is that right?

是的,你是对的。如果文件大小实际上小于块大小,那么它将由一个节点处理,节点从 1 增加到 6 可能不会影响执行速度。但是,您必须考虑推测执行的情况。在推测执行的情况下,即使是较小的文件也可能由 2 个节点同时处理,从而提高执行速度。

来自Yahoo Dev KB link,推测执行解释如下:

Speculative execution:

One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively using old API, while with newer API you may consider changing mapreduce.map.speculative and mapreduce.reduce.speculative.