Java:从文件中读取JSON,转换为ORC并写入文件
Java: Read JSON from a file, convert to ORC and write to a file
我需要自动执行 JSON 到 ORC 的转换过程。除了 JsonReader 不处理地图类型和 throws an exception 之外,我几乎可以通过使用 Apache 的 ORC-tools 包到达那里。因此,以下工作但不处理 Map 类型。
Path hadoopInputPath = new Path(input);
try (RecordReader recordReader = new JsonReader(hadoopInputPath, schema, hadoopConf)) { // throws when schema contains Map type
try (Writer writer = OrcFile.createWriter(new Path(output), OrcFile.writerOptions(hadoopConf).setSchema(schema))) {
VectorizedRowBatch batch = schema.createRowBatch();
while (recordReader.nextBatch(batch)) {
writer.addRowBatch(batch);
}
}
}
所以,我开始研究使用 Hive 类 进行 Json-to-ORC 转换,这有一个额外的优势,将来我可以转换为其他格式,例如 AVRO较小的代码更改。但是,我不确定使用 Hive 类 执行此操作的最佳方法是什么。具体来说,不清楚如何将HCatRecord写入文件,如下所示。
HCatRecordSerDe hCatRecordSerDe = new HCatRecordSerDe();
SerDeUtils.initializeSerDe(hCatRecordSerDe, conf, tblProps, null);
OrcSerde orcSerde = new OrcSerde();
SerDeUtils.initializeSerDe(orcSerde, conf, tblProps, null);
Writable orcOut = orcSerde.serialize(hCatRecord, hCatRecordSerDe.getObjectInspector());
assertNotNull(orcOut);
InputStream input = getClass().getClassLoader().getResourceAsStream("test.json.snappy");
SnappyCodec compressionCodec = new SnappyCodec();
try (CompressionInputStream inputStream = compressionCodec.createInputStream(input)) {
LineReader lineReader = new LineReader(new InputStreamReader(inputStream, Charsets.UTF_8));
String jsonLine = null;
while ((jsonLine = lineReader.readLine()) != null) {
Writable jsonWritable = new Text(jsonLine);
DefaultHCatRecord hCatRecord = (DefaultHCatRecord) jsonSerDe.deserialize(jsonWritable);
// TODO: Write ORC to file????
}
}
任何有关如何完成上述代码的想法或更简单的 JSON-to-ORC 方法都将不胜感激。
根据 cricket_007 建议,我最终使用 Spark 库做了以下事情:
Maven 依赖项(有一些排除项以使 maven-duplicate-finder-plugin 满意):
<properties>
<dep.jackson.version>2.7.9</dep.jackson.version>
<spark.version>2.2.0</spark.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_${scala.binary.version}</artifactId>
<version>${dep.jackson.version}</version>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>log4j</groupId>
<artifactId>apache-log4j-extras</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
<exclusion>
<groupId>net.java.dev.jets3t</groupId>
<artifactId>jets3t</artifactId>
</exclusion>
<exclusion>
<groupId>com.google.code.findbugs</groupId>
<artifactId>jsr305</artifactId>
</exclusion>
<exclusion>
<groupId>stax</groupId>
<artifactId>stax-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.objenesis</groupId>
<artifactId>objenesis</artifactId>
</exclusion>
</exclusions>
</dependency>
Java 代码概要:
SparkConf sparkConf = new SparkConf()
.setAppName("Converter Service")
.setMaster("local[*]");
SparkSession sparkSession = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
// read input data
Dataset<Row> events = sparkSession.read()
.format("json")
.schema(inputConfig.getSchema()) // StructType describing input schema
.load(inputFile.getPath());
// write data out
DataFrameWriter<Row> frameWriter = events
.selectExpr(
// useful if you want to change the schema before writing it to ORC, e.g. ["`col1` as `FirstName`", "`col2` as `LastName`"]
JavaConversions.asScalaBuffer(outputSchema.getColumns()))
.write()
.options(ImmutableMap.of("compression", "zlib"))
.format("orc")
.save(outputUri.getPath());
希望这有助于某人入门。
我需要自动执行 JSON 到 ORC 的转换过程。除了 JsonReader 不处理地图类型和 throws an exception 之外,我几乎可以通过使用 Apache 的 ORC-tools 包到达那里。因此,以下工作但不处理 Map 类型。
Path hadoopInputPath = new Path(input);
try (RecordReader recordReader = new JsonReader(hadoopInputPath, schema, hadoopConf)) { // throws when schema contains Map type
try (Writer writer = OrcFile.createWriter(new Path(output), OrcFile.writerOptions(hadoopConf).setSchema(schema))) {
VectorizedRowBatch batch = schema.createRowBatch();
while (recordReader.nextBatch(batch)) {
writer.addRowBatch(batch);
}
}
}
所以,我开始研究使用 Hive 类 进行 Json-to-ORC 转换,这有一个额外的优势,将来我可以转换为其他格式,例如 AVRO较小的代码更改。但是,我不确定使用 Hive 类 执行此操作的最佳方法是什么。具体来说,不清楚如何将HCatRecord写入文件,如下所示。
HCatRecordSerDe hCatRecordSerDe = new HCatRecordSerDe();
SerDeUtils.initializeSerDe(hCatRecordSerDe, conf, tblProps, null);
OrcSerde orcSerde = new OrcSerde();
SerDeUtils.initializeSerDe(orcSerde, conf, tblProps, null);
Writable orcOut = orcSerde.serialize(hCatRecord, hCatRecordSerDe.getObjectInspector());
assertNotNull(orcOut);
InputStream input = getClass().getClassLoader().getResourceAsStream("test.json.snappy");
SnappyCodec compressionCodec = new SnappyCodec();
try (CompressionInputStream inputStream = compressionCodec.createInputStream(input)) {
LineReader lineReader = new LineReader(new InputStreamReader(inputStream, Charsets.UTF_8));
String jsonLine = null;
while ((jsonLine = lineReader.readLine()) != null) {
Writable jsonWritable = new Text(jsonLine);
DefaultHCatRecord hCatRecord = (DefaultHCatRecord) jsonSerDe.deserialize(jsonWritable);
// TODO: Write ORC to file????
}
}
任何有关如何完成上述代码的想法或更简单的 JSON-to-ORC 方法都将不胜感激。
根据 cricket_007 建议,我最终使用 Spark 库做了以下事情:
Maven 依赖项(有一些排除项以使 maven-duplicate-finder-plugin 满意):
<properties>
<dep.jackson.version>2.7.9</dep.jackson.version>
<spark.version>2.2.0</spark.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_${scala.binary.version}</artifactId>
<version>${dep.jackson.version}</version>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>log4j</groupId>
<artifactId>apache-log4j-extras</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
<exclusion>
<groupId>net.java.dev.jets3t</groupId>
<artifactId>jets3t</artifactId>
</exclusion>
<exclusion>
<groupId>com.google.code.findbugs</groupId>
<artifactId>jsr305</artifactId>
</exclusion>
<exclusion>
<groupId>stax</groupId>
<artifactId>stax-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.objenesis</groupId>
<artifactId>objenesis</artifactId>
</exclusion>
</exclusions>
</dependency>
Java 代码概要:
SparkConf sparkConf = new SparkConf()
.setAppName("Converter Service")
.setMaster("local[*]");
SparkSession sparkSession = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
// read input data
Dataset<Row> events = sparkSession.read()
.format("json")
.schema(inputConfig.getSchema()) // StructType describing input schema
.load(inputFile.getPath());
// write data out
DataFrameWriter<Row> frameWriter = events
.selectExpr(
// useful if you want to change the schema before writing it to ORC, e.g. ["`col1` as `FirstName`", "`col2` as `LastName`"]
JavaConversions.asScalaBuffer(outputSchema.getColumns()))
.write()
.options(ImmutableMap.of("compression", "zlib"))
.format("orc")
.save(outputUri.getPath());
希望这有助于某人入门。