MapReduce：使用 Python[Streaming] 编写序列文件

Question

我正在尝试在 MapReduce 中编写序列文件。我用 java 成功做到了，但我不确定如何用 python.

谢谢！

Answer 1

Hadoop 接受 Streaming 命令选项 -outputformat。
要将输出文件生成为序列文件，请使用-outputformat SequenceFileOutputFormat.

例如：

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat\
    -mapper MapperClass \
    -reducer ReducerClass

默认情况下，-inputformat和-outputformat分别设置为TextInputFormat和TextOutputFormat。

MapReduce：使用 Python[Streaming] 编写序列文件

MapReduce: Writing Sequence file using Python[Streaming]

hadoop

mapreduce

hadoop-streaming