如何使 hadoop snappy 输出文件的格式与 Spark 生成的文件格式相同

Question

我们使用的是 Spark，到目前为止输出的是 PSV 文件。现在为了节省 space，我们想压缩输出。为此，我们将更改为使用 SnappyCodec 保存 JavaRDD，如下所示：

objectRDD.saveAsTextFile(rddOutputFolder, org.apache.hadoop.io.compress.SnappyCodec.class);

然后我们将使用 Sqoop 将输出导入数据库。整个过程正常。

对于之前在 HDFS 中生成的 PSV 文件，我们也想将它们压缩为 Snappy 格式。这是我们尝试过的命令：

hadoop jar /usr/hdp/2.6.5.106-2/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.5.106-2.jar \
-Dmapred.output.compress=true -Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-Dmapred.reduce.tasks=0 \
-input input-path \
-output output-path

命令运行良好。但问题是，sqoop 无法解析 snappy 输出文件。

当我们使用"hdfs dfs -text hdfs-file-name"这样的命令查看生成的文件时，输出如下所示，每行添加一个"index"类字段：

0       2019-05-02|AMRS||5072||||3540||MMPT|0|
41      2019-05-02|AMRS||5538|HK|51218||1000||Dummy|45276|
118     2019-05-02|AMRS||5448|US|51218|TRADING|2282|HFT|NCR|45119|

即在每一行的开头添加一个额外的值，如“0”、“41”、“118”。请注意，Spark 生成的 .snappy 文件没有此 "extra-field"。

知道如何防止插入这个额外字段吗？

非常感谢！

Answer 1

这些不是索引，而是 TextInputFormat 生成的键，如 here 所述。

The class you supply for the input format should return key/value pairs of Text class. If you do not specify an input format class, the TextInputFormat is used as the default. Since the TextInputFormat returns keys of LongWritable class, which are actually not part of the input data, the keys will be discarded; only the values will be piped to the streaming mapper.

并且由于您没有在作业中定义任何映射器，因此这些 key/value 对直接写入文件系统。因此，正如上面的摘录提示，您需要某种可以丢弃键的映射器。一个快速而肮脏的方法是使用一些已经可用的东西作为传递，比如 shell cat 命令：

hadoop jar /usr/hdp/2.6.5.106-2/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.5.106-2.jar \
-Dmapred.output.compress=true -Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
    -mapper /bin/cat \
-Dmapred.reduce.tasks=0 \
-input input-path \
-output output-path

如何使 hadoop snappy 输出文件的格式与 Spark 生成的文件格式相同

How to make hadoop snappy output file the same format as those generated by Spark

hadoop

hdfs

hadoop-streaming

snappy