Mapreduce - 保留输入顺序
Mapreduce - retain input order
具有由管道分隔的数字列表的文件,可以有重复项。需要编写 map reduce 程序来列出原始输入顺序中没有重复的数字。能够删除重复项,但不保留输入顺序。
很简单,假设你的文字是:
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.
Line 5 -> But his face you could not see,
Line 6 -> The Quangle Wangle sat,
其中 Line 2
和 3
在 line 5
和 6
处重复。
映射器应该类似于wordcount
程序,映射器的输入类似于
key-value对:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
(113, But his face you could not see,)
(146, The Quangle Wangle sat,)
映射器的输出
(NullWritable, 0_On the top of the Crumpetty Tree)
(NullWritable, 33_The Quangle Wangle sat,)
(NullWritable, 57_But his face you could not see,)
(NullWritable, 89_On account of his Beaver Hat.)
(NullWritable, 113_But his face you could not see,)
(NullWritable, 146_The Quangle Wangle sat,)
现在,确保你只有一个减速器,这样单个减速器的输入就是
减速机输入
Key: NullWritable
Iterable<value>: [(0_On the top of the Crumpetty Tree),
(33_The Quangle Wangle sat,),
(57_But his face you could not see,),
(89_On account of his Beaver Hat.),
(113_But his face you could not see,),
(146_The Quangle Wangle sat,)]
注意reducer的输入是按升序排序的,在这种情况下它保持原来的顺序,因为offset
排在[=22] =] 总是 ascending
顺序。
在 reducer 中,只需遍历列表,剔除重复项并在删除开头的 offset
和 _
分隔符后写入该行。 reducer 输出类似于:
减速机key-value
NullWritable, value.split("_")[1]
reducer 的输出
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.
具有由管道分隔的数字列表的文件,可以有重复项。需要编写 map reduce 程序来列出原始输入顺序中没有重复的数字。能够删除重复项,但不保留输入顺序。
很简单,假设你的文字是:
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.
Line 5 -> But his face you could not see,
Line 6 -> The Quangle Wangle sat,
其中 Line 2
和 3
在 line 5
和 6
处重复。
映射器应该类似于wordcount
程序,映射器的输入类似于
key-value对:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
(113, But his face you could not see,)
(146, The Quangle Wangle sat,)
映射器的输出
(NullWritable, 0_On the top of the Crumpetty Tree)
(NullWritable, 33_The Quangle Wangle sat,)
(NullWritable, 57_But his face you could not see,)
(NullWritable, 89_On account of his Beaver Hat.)
(NullWritable, 113_But his face you could not see,)
(NullWritable, 146_The Quangle Wangle sat,)
现在,确保你只有一个减速器,这样单个减速器的输入就是
减速机输入
Key: NullWritable
Iterable<value>: [(0_On the top of the Crumpetty Tree),
(33_The Quangle Wangle sat,),
(57_But his face you could not see,),
(89_On account of his Beaver Hat.),
(113_But his face you could not see,),
(146_The Quangle Wangle sat,)]
注意reducer的输入是按升序排序的,在这种情况下它保持原来的顺序,因为offset
排在[=22] =] 总是 ascending
顺序。
在 reducer 中,只需遍历列表,剔除重复项并在删除开头的 offset
和 _
分隔符后写入该行。 reducer 输出类似于:
减速机key-value
NullWritable, value.split("_")[1]
reducer 的输出
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.