为什么减少输入记录与减少输出记录不同？

Question

我尝试在 python 中将 mapreducer 与库 dumbo 一起使用。下面是我的实验测试代码，我希望我能收到从 mapper 到 reducer 输出的所有记录。

def mapper(key, value):
    fields = value.split("\t");    
    myword = fields[0] + "\t" + fields[1]
    yield myword, value

def reducer(key, values):
    for value in values:
        mypid = value
        words = value.split("\t")
    global count
    count = count + 1
    myword = str(count) + "--" + words[1]  ##to count total lines in recuder's output records
    yield myword, 1

if __name__ == "__main__":
    dumbo.run(mapper, reducer)

下面是Map-Reduce Framework 的日志。我希望 "Reduce input records" 等于 "Reduce output records" ，但事实并非如此。我的测试代码有什么问题，或者我误解了 mapreducer 中的某些内容？谢谢。

    Map-Reduce Framework
            Map input records=405057
            Map output records=405057
            Map output bytes=107178919
            Map output materialized bytes=108467155
            Input split bytes=2496
            Combine input records=0
            Combine output records=0
            Reduce input groups=63096
            Reduce shuffle bytes=108467155
            Reduce input records=405057
            Reduce output records=63096
            Spilled Records=810114

reducer修改如下：

def reducer(key, values):
    global count
    for value in values:
        mypid = value
        words = value.split("\t")

        count = count + 1
        myword = str(count) + "--" + words[1]  ##to count total lines in recuder's output records
        yield myword, 1

Answer 1

I expect the "Reduce input records" equal "Reduce output records" , but it is not .

我不确定您为什么期望这样。 reducer 的全部意义在于它一次接收一组值（基于映射器发出的键）；并且您的减速器仅为每个组发出一条记录 (yield myword, 1)。因此，您的 "Reduce input records" 等于您的 "Reduce output records" 的唯一方法是，如果每个组只包含一个记录——也就是说，如果每个值的前两个字段在您的记录集中是唯一的。由于情况显然并非如此，您的减速器发出的记录少于接收到的记录。

（事实上，这是通常的模式；这就是 "reducer" 被称为的原因。该名称来自函数式语言中的 'reduce'，它将值的集合简化为单个值。)

为什么减少输入记录与减少输出记录不同？

Why Reduce input records different with Reduce output records?

reduce

records

hadoop

mapper