在不忽略密钥的情况下声明 mrjob 映射器

Declare mrjob mapper without ignoring key

我想用 mrjob 声明一个映射器函数。因为我的mapper函数需要引用一些常量来做一些计算,所以我决定把这些常量放到mapper中的Key中(还有其他方法吗?)。我在 this site 上阅读了 mrjob 教程,但所有示例都忽略了密钥。例如:

class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)

基本上,我想要这样的东西:

def mapper(self, (constant1,constant2,constant3,constant4,constant5), line):
    My calculation goes here

请建议我怎么做。谢谢

您可以在 __init__

中设置常量
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, key, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1
        yield "Constant",self.constant

    def reducer(self, key, values):
        yield key, sum(values)

    def __init__(self,*args,**kwargs):
        super(MRWordFrequencyCount, self).__init__(*args, **kwargs)
        self.constant = 10


if __name__ == '__main__':
    MRWordFrequencyCount.run()

输出:

"Constant"  10
"chars" 12
"lines" 1
"words" 2

或者,您可以使用 RawProtocol

from mrjob.job import MRJob
import mrjob


class MRWordFrequencyCount(MRJob):
    INPUT_PROTOCOL = mrjob.protocol.RawProtocol

    def mapper(self, key, line):
        yield "constant", key
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        if str(key) != "constant":
            yield key, sum(values)
        else:
            yield "constant",list(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

如果输入是:

constant1,constant2,constant3   The quick brown fox jumps over the lazy dog

输出:

"chars" 43
"constant"  ["constant1,constant2,constant3"]
"lines" 1
"words" 9