如何使用 mapreduce 从 hbase SequenceFile 中提取键值对?
How to extract key,value pairs from hbase SequenceFile using mapreduce?
我使用 Hbase Export utility tool
将 hbase table 作为 SequenceFile
导出到 HDFS。
现在我想使用 mapreduce 作业来处理这个文件:
public class MapSequencefile {
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text>{
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
System.out.println(key+"...."+value);
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf , MapSequencefile.class.getSimpleName());
job.setJarByClass(MapSequencefile.class);
job.setNumReduceTasks(0);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(SequenceFileInputFormat.class); //use SequenceFileInputFormat
FileInputFormat.setInputPaths(job, "hdfs://192.16.31.10:8020/input/");
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.16.31.10:8020/out/"));
job.waitForCompletion(true);
}
}
但它总是抛出这个异常:
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1811)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1760)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1774)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
我该怎么做才能解决这个错误?
我假设您正在使用它来进行导出:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
如本 HBase 页面所述:http://hbase.apache.org/0.94/book/ops_mgt.html#export
查看 org.apache.hadoop.hbase.mapreduce.Export
的 source code,您可以看到它设置:
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Result.class);
这与您的错误一致(值是一个 Result
对象):
Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'
因此您的地图签名需要更改为:
Mapper<ImmutableBytesWritable, Result, Text, Text>
并且您需要在项目中包含正确的 HBase 库,以便它可以访问:
org.apache.hadoop.hbase.client.Result
我使用 Hbase Export utility tool
将 hbase table 作为 SequenceFile
导出到 HDFS。
现在我想使用 mapreduce 作业来处理这个文件:
public class MapSequencefile {
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text>{
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
System.out.println(key+"...."+value);
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf , MapSequencefile.class.getSimpleName());
job.setJarByClass(MapSequencefile.class);
job.setNumReduceTasks(0);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(SequenceFileInputFormat.class); //use SequenceFileInputFormat
FileInputFormat.setInputPaths(job, "hdfs://192.16.31.10:8020/input/");
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.16.31.10:8020/out/"));
job.waitForCompletion(true);
}
}
但它总是抛出这个异常:
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1811)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1760)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1774)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
我该怎么做才能解决这个错误?
我假设您正在使用它来进行导出:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
如本 HBase 页面所述:http://hbase.apache.org/0.94/book/ops_mgt.html#export
查看 org.apache.hadoop.hbase.mapreduce.Export
的 source code,您可以看到它设置:
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Result.class);
这与您的错误一致(值是一个 Result
对象):
Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'
因此您的地图签名需要更改为:
Mapper<ImmutableBytesWritable, Result, Text, Text>
并且您需要在项目中包含正确的 HBase 库,以便它可以访问:
org.apache.hadoop.hbase.client.Result