Spark - 为具有多列的一个 rowKey 创建 HFile
Spark - Create HFile for one rowKey with multiple columns
JavaRDD<String> hbaseFile = jsc.textFile(HDFS_MASTER+HBASE_FILE);
JavaPairRDD<ImmutableBytesWritable, KeyValue> putJavaRDD = hbaseFile.mapToPair(line -> convertToKVCol1(line, COLUMN_AGE));
putJavaRDD.sortByKey(true);
putJavaRDD.saveAsNewAPIHadoopFile(stagingFolder, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
private static Tuple2<ImmutableBytesWritable, KeyValue> convertToKVCol1(String beanString, byte[] column) {
InspurUserEntity inspurUserEntity = gson.fromJson(beanString, InspurUserEntity.class);
String rowKey = inspurUserEntity.getDepartment_level1()+"_"+inspurUserEntity.getDepartment_level2()+"_"+inspurUserEntity.getId();
return new Tuple2<>(new ImmutableBytesWritable(Bytes.toBytes(rowKey)),
new KeyValue(Bytes.toBytes(rowKey), COLUMN_FAMILY, column, Bytes.toBytes(inspurUserEntity.getAge())));
}
以上是我的代码,它只适用于行键的单个列。有没有为一个行键创建一个多列的 HFile 的想法?
您必须在声明中使用数组而不是 ImmutableBytesWritable。
您可以为一行创建多个 Tuple2<ImmutableBytesWritable, KeyValue>
,其中键保持不变,KeyValue
代表单个单元格值。
确保也按字典顺序对您的专栏进行排序。
所以你应该在 JavaPairRDD<ImmutableBytesWritable, KeyValue>
.
上调用 saveAsNewAPIHadoopFile
final JavaPairRDD<ImmutableBytesWritable, KeyValue> writables = myRdd.flatMapToPair(record -> {
final List<Tuple2<ImmutableBytesWritable, KeyValue>> listToReturn = new ArrayList<>();
// Add first column to the collection
listToReturn.add(new Tuple2<ImmutableBytesWritable, KeyValue>(
new ImmutableBytesWritable(Bytes.toBytes(record.getRowKey())),
new KeyValue(Bytes.toBytes(record.getRowKey()), Bytes.toBytes("CF"),
Bytes.toBytes("COL1"), System.currentTimeMillis(),
Bytes.toBytes(record.getCol1()))));
// Add subsequent columns
listToReturn.add(new Tuple2<ImmutableBytesWritable, KeyValue>(
new ImmutableBytesWritable(Bytes.toBytes(record.getRowKey())),
new KeyValue(Bytes.toBytes(record.getRowKey()), Bytes.toBytes("CF"),
Bytes.toBytes("COL2"), System.currentTimeMillis(),
Bytes.toBytes(record.getCol2()))));
});
注意:这是一个主要陷阱,您必须也按字典顺序将您的列添加到 RDD。
基本上这个组合:行键+列族+列限定符应该在你继续推出 HFiles 之前排序。
JavaRDD<String> hbaseFile = jsc.textFile(HDFS_MASTER+HBASE_FILE);
JavaPairRDD<ImmutableBytesWritable, KeyValue> putJavaRDD = hbaseFile.mapToPair(line -> convertToKVCol1(line, COLUMN_AGE));
putJavaRDD.sortByKey(true);
putJavaRDD.saveAsNewAPIHadoopFile(stagingFolder, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
private static Tuple2<ImmutableBytesWritable, KeyValue> convertToKVCol1(String beanString, byte[] column) {
InspurUserEntity inspurUserEntity = gson.fromJson(beanString, InspurUserEntity.class);
String rowKey = inspurUserEntity.getDepartment_level1()+"_"+inspurUserEntity.getDepartment_level2()+"_"+inspurUserEntity.getId();
return new Tuple2<>(new ImmutableBytesWritable(Bytes.toBytes(rowKey)),
new KeyValue(Bytes.toBytes(rowKey), COLUMN_FAMILY, column, Bytes.toBytes(inspurUserEntity.getAge())));
}
以上是我的代码,它只适用于行键的单个列。有没有为一个行键创建一个多列的 HFile 的想法?
您必须在声明中使用数组而不是 ImmutableBytesWritable。
您可以为一行创建多个 Tuple2<ImmutableBytesWritable, KeyValue>
,其中键保持不变,KeyValue
代表单个单元格值。
确保也按字典顺序对您的专栏进行排序。
所以你应该在 JavaPairRDD<ImmutableBytesWritable, KeyValue>
.
saveAsNewAPIHadoopFile
final JavaPairRDD<ImmutableBytesWritable, KeyValue> writables = myRdd.flatMapToPair(record -> {
final List<Tuple2<ImmutableBytesWritable, KeyValue>> listToReturn = new ArrayList<>();
// Add first column to the collection
listToReturn.add(new Tuple2<ImmutableBytesWritable, KeyValue>(
new ImmutableBytesWritable(Bytes.toBytes(record.getRowKey())),
new KeyValue(Bytes.toBytes(record.getRowKey()), Bytes.toBytes("CF"),
Bytes.toBytes("COL1"), System.currentTimeMillis(),
Bytes.toBytes(record.getCol1()))));
// Add subsequent columns
listToReturn.add(new Tuple2<ImmutableBytesWritable, KeyValue>(
new ImmutableBytesWritable(Bytes.toBytes(record.getRowKey())),
new KeyValue(Bytes.toBytes(record.getRowKey()), Bytes.toBytes("CF"),
Bytes.toBytes("COL2"), System.currentTimeMillis(),
Bytes.toBytes(record.getCol2()))));
});
注意:这是一个主要陷阱,您必须也按字典顺序将您的列添加到 RDD。
基本上这个组合:行键+列族+列限定符应该在你继续推出 HFiles 之前排序。