如何在 mrjob 中获取文本中的平均单词数？

Question

我在 mrjob mareduce 框架中遇到了一个简单的问题：我想获得给定段落中的平均单词数，我得到了这个：

class LineAverage(MRJob):

def mapper(self, _, line):
    numwords = len(line.split())
    yield "words", numwords
    yield "lines", 1


def reducer(self, key, values):
    yield key, sum(values)

使用这段代码，我在 reduce 过程后得到了文本中的行数和单词总数，但我不知道如何通过以下方式获得平均值：

words/TotalOfLines

我是这种编程模型的新手，如果有人能说明这个例子，我将不胜感激。

同时，非常感谢您的关注和参与

Answer 1

在你的 reducer 中，你已经将你的键 sum(values) 输出到输出文件。您只需要将输出文件读入 Java/Scala 程序并计算平均值。

Answer 2

毕竟，答案很简单：我实际上向 reducer 发送了一些等于行数的值。因此，在 reducer 中，我只需要计算键值的数量。

class LineAverage(MRJob):

def mapper(self, _, line):
    numwords = len(line.split())
    yield "words", numwords


def reducer(self, key, values):
    i,totalL,totalW=0,0,0
    for i in values:
        totalL += 1
        totalW += i     
    yield "avg", totalW/float(totalL)

因此映射器为每一行发送一对 ("words", x)，洗牌过程将产生一个元组：("words": x1, x2, x3,..xnumberOfLines) whic 是 reducer 的输入，然后我只需要计算键值的数量，仅此而已，我得到了行数。

希望对某些人有所帮助。

如何在 mrjob 中获取文本中的平均单词数？

how to get the average number of words in a text in mrjob?

average

mapreduce

mrjob