不可序列化结果:org.apache.hadoop.io.IntWritable 使用 Spark / Scala 读取序列文件时
Not serializable result: org.apache.hadoop.io.IntWritable when reading Sequence File with Spark / Scala
逻辑读取一个包含 Int 和 String 的序列文件,
那么如果我这样做:
val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
.map{case (x, y) => (x.toString(), y.toString().split("/")(0), y.toString().split("/")(1))}
.collect
这没问题,因为 IntWritable 已转换为 String。
如果我这样做:
val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
.map{case (x, y) => (x, y.toString().split("/")(0), y.toString().split("/")(1))}
.collect
然后我立即得到这个错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0 in stage 42.0 (TID 692) had a not serializable result: org.apache.hadoop.io.IntWritable
根本原因还不是很清楚——连载,但为什么这么难?这是我注意到的另一种序列化方面。此外,它仅在 运行 时间被记录。
如果目标只是获取整数值,则需要对可写对象调用 get
.map{case (x, y) => (x.get()
然后 JVM 处理 serialization of the Integer object rather than not knowing how to process a IntWritable because it doesn't implement the Serializable interface
逻辑读取一个包含 Int 和 String 的序列文件,
那么如果我这样做:
val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
.map{case (x, y) => (x.toString(), y.toString().split("/")(0), y.toString().split("/")(1))}
.collect
这没问题,因为 IntWritable 已转换为 String。
如果我这样做:
val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
.map{case (x, y) => (x, y.toString().split("/")(0), y.toString().split("/")(1))}
.collect
然后我立即得到这个错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0 in stage 42.0 (TID 692) had a not serializable result: org.apache.hadoop.io.IntWritable
根本原因还不是很清楚——连载,但为什么这么难?这是我注意到的另一种序列化方面。此外,它仅在 运行 时间被记录。
如果目标只是获取整数值,则需要对可写对象调用 get
.map{case (x, y) => (x.get()
然后 JVM 处理 serialization of the Integer object rather than not knowing how to process a IntWritable because it doesn't implement the Serializable interface