如何停用 Hadoop 流中的输出？

Question

我正在我的集群上编写 Python mapreduce 程序。我的映射器解析数据并将它们存储在 HBase 中。没有reducer，没有输出。

以下代码供参考，如有需要

class Mapper:
  ...
  def __init__(...)
     ...

  def start(self, file):
    generator = self.read_input(file)
    connection = happybase.Connection(Mapper.IP)
    self.table = connection.table(Mapper.table_name)
    for line in generator:
      self.parse(line)
      self.write()
      self.buffers = []
    self.table = None
    connection.close()

  def read_input(self, file):
    ...
  def parse(self, line):
    ...
  def write(self):
    # write buffers into HBase
    for cell in self.buffers:
      self.table.put(cell[0], cell[1])     <-  Into HBase yay

我的问题是：如果我在我的集群中使用这个命令：

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-D mapred.reduce.tasks=1 \
-file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
-input /user/hduser/streamingTest/testFile.csv

它会说：哎呀，错误streaming.StreamJob：缺少必需的选项：输出

我可以将输出重定向到标准输出，或者完全停用它吗？

PS：我是一个很烂的python程序员，有什么让你不舒服的代码请指出

Answer 1

您将需要生成一些输出。鉴于不输出 anything 的愿望，请使用

NullOutputFormat

如下：

---outputformat org.apache.mapreduce.lib.NullOutputFormat

如何停用 Hadoop 流中的输出？

How to deactivate output in Hadoop streaming?

python

hbase

mapreduce

hadoop-streaming