在 Spark 中将纯文本文件转换为 Hadoop 序列文件

Question

我现有的项目正在使用 Hadoop map-reduce 生成具有自定义键和值的序列文件，格式为 XML。

XML 值是通过从输入源一次读取一行生成的，RecordReader 实现为 return 来自纯文本的 XML 格式的下一个值.

例如输入源文件有 3 行（第一行是 header，其余行有实际数据）

id|name|value
1|Vijay|1000
2|Gaurav|2000
3|Ashok|3000

Post map方法序列文件有如下数据：

FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>1</id><name>Vijay</name><value>1000</value></bars>
FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>2</id><name>Gaurav</name><value>2000</value></bars>
FeedInstanceKey{feedInstanceId=1000, entity=bars}   <?xml version='1.0' encoding='UTF-8'?><bars><id>3</id><name>Ashok</name><value>3000</value></bars>

问题：我希望在 Spark 中实现相同的功能。基本上，读取输入文件并生成键值对，如上所示。

是否有 way/possible 可以重用现有的 InputFormat 以及我的 Hadoop 映射器中使用的 RecordReader class。

RecordReader是responsible/having将纯文本行转换为XML的逻辑，return作为Hadoop map方法的值写入context.write()方法。

请多多指教。

Answer 1

External Datasets 部分的 Spark 文档对此进行了介绍。对您来说重要的部分是：

For other Hadoop InputFormats, you can use the JavaSparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use JavaSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).

这是一个演示如何使用它的简单示例：

public final class ExampleSpark {

    public static void main(String[] args) throws Exception {
        JavaSparkContext spark = new JavaSparkContext();
        Configuration jobConf = new Configuration();

        JavaPairRDD<LongWritable, Text> inputRDD = spark.newAPIHadoopFile(args[0], TextInputFormat.class, LongWritable.class, Text.class, jobConf);
        System.out.println(inputRDD.count());

        spark.stop();
        System.exit(0);
    }
}

您可以查看 JavaSparkContext 的 Javadocs here。

在 Spark 中将纯文本文件转换为 Hadoop 序列文件

Convert plain text file to Hadoop sequence file in Spark

java

xml

hadoop

mapreduce

apache-spark