不可序列化结果:org.apache.hadoop.io.IntWritable 使用 Spark / Scala 读取序列文件时

Not serializable result: org.apache.hadoop.io.IntWritable when reading Sequence File with Spark / Scala

逻辑读取一个包含 Int 和 String 的序列文件,

那么如果我这样做:

val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
                  .map{case (x, y) => (x.toString(), y.toString().split("/")(0), y.toString().split("/")(1))}
                  .collect

这没问题,因为 IntWritable 已转换为 String。

如果我这样做:

val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
                  .map{case (x, y) => (x, y.toString().split("/")(0), y.toString().split("/")(1))}
                  .collect 

然后我立即得到这个错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0 in stage 42.0 (TID 692) had a not serializable result: org.apache.hadoop.io.IntWritable

根本原因还不是很清楚——连载,但为什么这么难?这是我注意到的另一种序列化方面。此外,它仅在 运行 时间被记录。

如果目标只是获取整数值,则需要对可写对象调用 get

.map{case (x, y) => (x.get()

然后 JVM 处理 serialization of the Integer object rather than not knowing how to process a IntWritable because it doesn't implement the Serializable interface

String does implement Serializable