Hadoop 流式处理 'cat' 和 'wc' 示例---'cat' 映射器和 'wc' 减速器如何实际工作

Question

我的问题是这样的。 Apache Hadoop，在 its documentation mentions 以下一个 hadoop 流的示例代码中：

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc

现在我向这个主播提供一个文本文件。比方说，文本文件仅包含以下两行：

This is line1
It becomes line2

hadoop streaming 命令完美运行，没有任何问题。

但是尽管我多次阅读上面链接的 material 和 Internet 上的其他示例，但我无法回答以下问题。假设只有一个 mapper 和一个 reducer：

据我了解，Mapper 获取（键，值）对作为输入。在以上两行的情况下，键是什么，值是什么。
Mapper 函数是 'cat'。将'cat'作用于映射器的键部分或映射器的值部分。
如果输入只是以上两行，mapper的输出会是什么。 'key' 是什么，'value' 部分是什么。
Reducer 将获得这些 (key,value) 对。 reducer函数是'wc'。 'wc' 怎么知道是对 'key' 还是对这个元组的 'value' 采取行动。

我知道这些是非常基本的问题，但我一次又一次地陷入困境以获得正确的答案。将不胜感激。

谢谢。

Answer 1

In the case of the above two lines what would be the key and what would be the value.

关键是线的偏移量。值为整行文字

映射器作用于键和值

我相信映射器的输出对于每一行都是相同的，或者至少是 (null, line)。

wc 将对每个唯一键进行操作，因此如果您只得到一个结果作为输出，那么输入可能是 (null, ["this line one", "it becomes line2"])，并且值列表被计算为

Hadoop streaming 'cat' and 'wc' example---how do 'cat' mapper and 'wc' reducer actually work