由于不可序列化的对象,Spark 作业失败
Spark job failed due to not serializable objects
我是 运行 为我的 HBase
数据存储生成 HFiles
的 spark 作业。
它曾经在我的 Cloudera
集群上运行良好,但是当我们切换到 EMR
集群时,它失败并显示以下堆栈跟踪:
Serialization stack:
- object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35); not retrying
Serialization stack:
- object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1492)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1492)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:803)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1720)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset.apply$mcV$sp(PairRDDFunctions.scala:1158)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile.apply$mcV$sp(PairRDDFunctions.scala:1005)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:823)
我的问题:
- 是什么导致了两次运行之间的差异?两个集群的版本差异?
- 我做了研究并发现了这个 post:然后我将 Kyro 参数添加到我的 spark-submit 命令中,现在我的命令如下所示:
spark-submit --conf spark.kryo.classesToRegister=org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.KeyValue --master yarn --deploy-mode client --driver-memory 16G --executor-memory 18G ...
但是,我仍然遇到同样的错误。
这是我的 Java 代码:
protected void generateHFilesUsingSpark(JavaRDD<Row> rdd) throws Exception {
JavaPairRDD<ImmutableBytesWritable, KeyValue> javaPairRdd = rdd.mapToPair(
new PairFunction<Row, ImmutableBytesWritable, KeyValue>() {
public Tuple2<ImmutableBytesWritable, KeyValue> call(Row row) throws Exception {
String key = (String) row.get(0);
String value = (String) row.get(1);
ImmutableBytesWritable rowKey = new ImmutableBytesWritable();
byte[] rowKeyBytes = Bytes.toBytes(key);
rowKey.set(rowKeyBytes);
KeyValue keyValue = new KeyValue(rowKeyBytes,
Bytes.toBytes("COL"),
Bytes.toBytes("FM"),
ProductJoin.newBuilder()
.setId(key)
.setSolrJson(value)
.build().toByteArray());
return new Tuple2<ImmutableBytesWritable, KeyValue>(rowKey, keyValue);
}
});
Configuration baseConf = HBaseConfiguration.create();
Configuration conf = new Configuration();
conf.set(HBASE_ZOOKEEPER_QUORUM, "xxx.xxx.xx.xx");
Job job = new Job(baseConf, "APP-NAME");
HTable table = new HTable(conf, "hbaseTargetTable");
Partitioner partitioner = new IntPartitioner(importerParams.shards);
JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
javaPairRdd.repartitionAndSortWithinPartitions(partitioner);
HFileOutputFormat2.configureIncrementalLoad(job, table);
System.out.println("Done configuring incremental load....");
Configuration config = job.getConfiguration();
repartitionedRdd.saveAsNewAPIHadoopFile(
"hfilesOutputPath",
ImmutableBytesWritable.class,
KeyValue.class,
HFileOutputFormat2.class,
config
);
System.out.println("Saved to HFiles to: " + importerParams.hfilesOutputPath);
}
好的,问题解决了,诀窍是使用 Kryo Serializer,我在我的 Java 代码中添加了这个来注册 ImmutableBytesWritable。
SparkSession.Builder builder = SparkSession.builder().appName("AWESOME");
builder.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
SparkConf conf = new SparkConf().setAppName("AWESOME");
Class<?>[] classes = new Class[]{org.apache.hadoop.hbase.io.ImmutableBytesWritable.class};
conf.registerKryoClasses(classes);
builder.config(conf);
SparkSession spark = builder.getOrCreate();
我是 运行 为我的 HBase
数据存储生成 HFiles
的 spark 作业。
它曾经在我的 Cloudera
集群上运行良好,但是当我们切换到 EMR
集群时,它失败并显示以下堆栈跟踪:
Serialization stack:
- object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35); not retrying
Serialization stack:
- object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1492)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1492)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:803)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1720)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset.apply$mcV$sp(PairRDDFunctions.scala:1158)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile.apply$mcV$sp(PairRDDFunctions.scala:1005)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:823)
我的问题:
- 是什么导致了两次运行之间的差异?两个集群的版本差异?
- 我做了研究并发现了这个 post:然后我将 Kyro 参数添加到我的 spark-submit 命令中,现在我的命令如下所示:
spark-submit --conf spark.kryo.classesToRegister=org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.KeyValue --master yarn --deploy-mode client --driver-memory 16G --executor-memory 18G ...
但是,我仍然遇到同样的错误。
这是我的 Java 代码:
protected void generateHFilesUsingSpark(JavaRDD<Row> rdd) throws Exception {
JavaPairRDD<ImmutableBytesWritable, KeyValue> javaPairRdd = rdd.mapToPair(
new PairFunction<Row, ImmutableBytesWritable, KeyValue>() {
public Tuple2<ImmutableBytesWritable, KeyValue> call(Row row) throws Exception {
String key = (String) row.get(0);
String value = (String) row.get(1);
ImmutableBytesWritable rowKey = new ImmutableBytesWritable();
byte[] rowKeyBytes = Bytes.toBytes(key);
rowKey.set(rowKeyBytes);
KeyValue keyValue = new KeyValue(rowKeyBytes,
Bytes.toBytes("COL"),
Bytes.toBytes("FM"),
ProductJoin.newBuilder()
.setId(key)
.setSolrJson(value)
.build().toByteArray());
return new Tuple2<ImmutableBytesWritable, KeyValue>(rowKey, keyValue);
}
});
Configuration baseConf = HBaseConfiguration.create();
Configuration conf = new Configuration();
conf.set(HBASE_ZOOKEEPER_QUORUM, "xxx.xxx.xx.xx");
Job job = new Job(baseConf, "APP-NAME");
HTable table = new HTable(conf, "hbaseTargetTable");
Partitioner partitioner = new IntPartitioner(importerParams.shards);
JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
javaPairRdd.repartitionAndSortWithinPartitions(partitioner);
HFileOutputFormat2.configureIncrementalLoad(job, table);
System.out.println("Done configuring incremental load....");
Configuration config = job.getConfiguration();
repartitionedRdd.saveAsNewAPIHadoopFile(
"hfilesOutputPath",
ImmutableBytesWritable.class,
KeyValue.class,
HFileOutputFormat2.class,
config
);
System.out.println("Saved to HFiles to: " + importerParams.hfilesOutputPath);
}
好的,问题解决了,诀窍是使用 Kryo Serializer,我在我的 Java 代码中添加了这个来注册 ImmutableBytesWritable。
SparkSession.Builder builder = SparkSession.builder().appName("AWESOME");
builder.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
SparkConf conf = new SparkConf().setAppName("AWESOME");
Class<?>[] classes = new Class[]{org.apache.hadoop.hbase.io.ImmutableBytesWritable.class};
conf.registerKryoClasses(classes);
builder.config(conf);
SparkSession spark = builder.getOrCreate();