从 JAR 中读取 Avro parquet 文件
Read Avro parquet file from inside JAR
我正在尝试读取作为资源捆绑在 JAR 中的镶木地板文件,最好是作为流。
有没有人有一个不涉及首先将资源作为临时文件写出的工作示例?
这是我用来读取文件的代码,在打包为 JAR 之前在 IDE 中可以正常工作:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
try {
Path path = new Path(classLoader.getResource(pattern_id).toURI());
Configuration conf = new Configuration();
try (ParquetReader<GenericRecord> r = AvroParquetReader.<GenericRecord>builder(
HadoopInputFile.fromPath(path, conf))
.disableCompatibility()
.build()) {
patternsFound.add(pattern_id);
GenericRecord record;
while ((record = r.read()) != null) {
// Do some work
}
} catch (IOException e) {
e.printStackTrace();
}
} catch (NullPointerException | URISyntaxException e) {
e.printStackTrace();
}
当 运行 此代码来自 JAR 文件时,出现此错误:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "jar"
我认为我可以通过以下方式绕过:
InputStream inputFile = classLoader.getResourceAsStream(pattern_id);
但不知道如何让 AvroParquetReader 与输入流一起工作。
通过调整此处的解决方案,我最终能够将 parquet 文件作为资源流读取:
import org.apache.commons.io.IOUtils;
import org.apache.parquet.io.DelegatingSeekableInputStream;
import org.apache.parquet.io.InputFile;
import org.apache.parquet.io.SeekableInputStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
public class ParquetStream implements InputFile {
private final String streamId;
private final byte[] data;
private static class SeekableByteArrayInputStream extends ByteArrayInputStream {
public SeekableByteArrayInputStream(byte[] buf) {
super(buf);
}
public void setPos(int pos) {
this.pos = pos;
}
public int getPos() {
return this.pos;
}
}
public ParquetStream(String streamId, InputStream stream) throws IOException {
this.streamId = streamId;
this.data = IOUtils.toByteArray(stream);
}
@Override
public long getLength() {
return this.data.length;
}
@Override
public SeekableInputStream newStream() throws IOException {
return new DelegatingSeekableInputStream(new SeekableByteArrayInputStream(this.data)) {
@Override
public void seek(long newPos) {
((SeekableByteArrayInputStream) this.getStream()).setPos((int) newPos);
}
@Override
public long getPos() {
return ((SeekableByteArrayInputStream) this.getStream()).getPos();
}
};
}
@Override
public String toString() {
return "ParquetStream[" + streamId + "]";
}
}
那我就可以这样做了:
InputStream in = classLoader.getResourceAsStream(pattern_id);
try {
ParquetStream parquetStream = new ParquetStream(pattern_id, in);
ParquetReader<GenericRecord> r = AvroParquetReader.<GenericRecord>builder(parquetStream)
.disableCompatibility()
.build();
GenericRecord record;
while ((record = r.read()) != null) {
// do some work
}
} catch (IOException e) {
e.printStackTrace();
}
也许这对以后的人有帮助,因为我找不到任何直接的答案。
我正在尝试读取作为资源捆绑在 JAR 中的镶木地板文件,最好是作为流。
有没有人有一个不涉及首先将资源作为临时文件写出的工作示例?
这是我用来读取文件的代码,在打包为 JAR 之前在 IDE 中可以正常工作:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
try {
Path path = new Path(classLoader.getResource(pattern_id).toURI());
Configuration conf = new Configuration();
try (ParquetReader<GenericRecord> r = AvroParquetReader.<GenericRecord>builder(
HadoopInputFile.fromPath(path, conf))
.disableCompatibility()
.build()) {
patternsFound.add(pattern_id);
GenericRecord record;
while ((record = r.read()) != null) {
// Do some work
}
} catch (IOException e) {
e.printStackTrace();
}
} catch (NullPointerException | URISyntaxException e) {
e.printStackTrace();
}
当 运行 此代码来自 JAR 文件时,出现此错误:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "jar"
我认为我可以通过以下方式绕过:
InputStream inputFile = classLoader.getResourceAsStream(pattern_id);
但不知道如何让 AvroParquetReader 与输入流一起工作。
通过调整此处的解决方案,我最终能够将 parquet 文件作为资源流读取:
import org.apache.commons.io.IOUtils;
import org.apache.parquet.io.DelegatingSeekableInputStream;
import org.apache.parquet.io.InputFile;
import org.apache.parquet.io.SeekableInputStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
public class ParquetStream implements InputFile {
private final String streamId;
private final byte[] data;
private static class SeekableByteArrayInputStream extends ByteArrayInputStream {
public SeekableByteArrayInputStream(byte[] buf) {
super(buf);
}
public void setPos(int pos) {
this.pos = pos;
}
public int getPos() {
return this.pos;
}
}
public ParquetStream(String streamId, InputStream stream) throws IOException {
this.streamId = streamId;
this.data = IOUtils.toByteArray(stream);
}
@Override
public long getLength() {
return this.data.length;
}
@Override
public SeekableInputStream newStream() throws IOException {
return new DelegatingSeekableInputStream(new SeekableByteArrayInputStream(this.data)) {
@Override
public void seek(long newPos) {
((SeekableByteArrayInputStream) this.getStream()).setPos((int) newPos);
}
@Override
public long getPos() {
return ((SeekableByteArrayInputStream) this.getStream()).getPos();
}
};
}
@Override
public String toString() {
return "ParquetStream[" + streamId + "]";
}
}
那我就可以这样做了:
InputStream in = classLoader.getResourceAsStream(pattern_id);
try {
ParquetStream parquetStream = new ParquetStream(pattern_id, in);
ParquetReader<GenericRecord> r = AvroParquetReader.<GenericRecord>builder(parquetStream)
.disableCompatibility()
.build();
GenericRecord record;
while ((record = r.read()) != null) {
// do some work
}
} catch (IOException e) {
e.printStackTrace();
}
也许这对以后的人有帮助,因为我找不到任何直接的答案。