如何将 BufferedImage RDD 保存为 HDFS 文件

Question

我需要从 HDFS 读取图像，进行一些处理并将图像保存回 HDFS。这个处理必须在 spark 中完成。我将图像文件读取为 sc.binaryFiles，然后将它们转换为缓冲图像并执行一些操作。但是当我尝试将 RDD[BufferedImage] 保存到 FSDataOutputStream

时出现 "Task not serializable" 错误

    //read binary files from RDD
    val images = sc.binaryFiles("/tmp/images/")
    //images: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = /tmp/images/ 

    //get BufferedImageRDD
    val bufImages = images.map(x => ImageIO.read(x._2.open))
    //bufImages: org.apache.spark.rdd.RDD[java.awt.image.BufferedImage] = MapPartitionsRDD[1]


    //try saving in local directory
    bufImages.foreach(x => UtilImageIO.saveImage(x,"Mean3.jpg"))
    //success



    //try saving in hdfs

    val conf = new Configuration()
    val fileSystem = FileSystem.get(conf);
    val out = fileSystem.create(new Path("/tmp/img1.png"));
    //out: org.apache.hadoop.fs.FSDataOutputStream = org.apache.hadoop.hdfs.client.HdfsDataOutputStream@440f55ad


    bufImages.foreach(x => ImageIO.write(x,"png", out))

以上代码抛出以下错误

    org.apache.spark.SparkException: Task not serializable
      at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
      at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
      at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
      at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
      at org.apache.spark.rdd.RDD$$anonfun$foreach.apply(RDD.scala:926)
      at org.apache.spark.rdd.RDD$$anonfun$foreach.apply(RDD.scala:925)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
      at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
      at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
      ... 49 elided
    Caused by: java.io.NotSerializableException: org.apache.hadoop.hdfs.client.HdfsDataOutputStream
    Serialization stack:
        - object not serializable (class: org.apache.hadoop.hdfs.client.HdfsDataOutputStream, value: org.apache.hadoop.hdfs.client.HdfsDataOutputStream@440f55ad)
        - field (class: $iw, name: out, type: class org.apache.hadoop.fs.FSDataOutputStream)
        - object (class $iw, $iw@13c2b782)
        - field (class: $iw, name: $iw, type: class $iw)
        - object (class $iw, $iw@28aedf6e)
        - field (class: $iw, name: $iw, type: class $iw)
        - object (class $iw, $iw@14d0c3ff)
        - field (class: $iw, name: $iw, type: class $iw)
        - object (class $iw, $iw@48eb05e9)
        - field (class: $iw, name: $iw, type: class $iw)
        - object (class $iw, $iw@6b9ba1a6)
        - field (class: $iw, name: $iw, type: class $iw)
        - object (class $iw, $iw@53d519cb)
        - field (class: $iw, name: $iw, type: class $iw)
        - object (class $iw, $iw@45d7e92)
        - field (class: $iw, name: $iw, type: class $iw)
        - object (class $iw, $iw@79c1301b)
        - field (class: $line49.$read, name: $iw, type: class $iw)
        - object (class $line49.$read, $line49.$read@1a714d1)
        - field (class: $iw, name: $line49$read, type: class $line49.$read)
        - object (class $iw, $iw@79ef07b3)
        - field (class: $iw, name: $outer, type: class $iw)
        - object (class $iw, $iw@2dd246ff)
        - field (class: $anonfun, name: $outer, type: class $iw)
        - object (class $anonfun, <function1>)
      at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
      at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
      at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
      at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
      ... 58 more

如果有任何具体的方法可以实现，请告诉我。

Answer 1

rdd 上的 foreach 方法只需要参数是可序列化的。因此，只需为 ImageIO.write(x,"png", out) 编写一个带有可序列化参数的包装器，我就能完成这项工作。

如何将 BufferedImage RDD 保存为 HDFS 文件

How to save a BufferedImage RDD as HDFS file

bufferedimage

image

hdfs

apache-spark

rdd