避免在 Spark Streaming 中为空分区写入文件

Question

我有从 kafka 分区 (one executor per partition) 读取数据的 Spark Streaming 作业。
我需要将转换后的值保存到 HDFS，但需要避免创建空文件。
我尝试使用 isEmpty，但当并非所有分区都为空时这无济于事。

P.S。由于性能下降，重新分区不是可接受的解决方案。

Answer 1

该代码仅适用于 PairRDD。

文本代码：

  val conf = ssc.sparkContext.hadoopConfiguration
  conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
    classOf[TextOutputFormat[Text, NullWritable]]
    classOf[OutputFormat[Text, NullWritable]])

  kafkaRdd.map(_.value -> NullWritable.get)
    .saveAsNewAPIHadoopFile(basePath,
      classOf[Text],
      classOf[NullWritable],
      classOf[LazyOutputFormat[Text, NullWritable]],
      conf)

avro 代码：

  val avro: RDD[(AvroKey[MyEvent], NullWritable)]) = ....
  val conf = ssc.sparkContext.hadoopConfiguration

  conf.set("avro.schema.output.key", MyEvent.SCHEMA$.toString)
  conf.setClass("mapreduce.output.lazyoutputformat.outputformat",
    classOf[AvroKeyOutputFormat[MyEvent]],
    classOf[OutputFormat[AvroKey[MyEvent], NullWritable]])

  avro.saveAsNewAPIHadoopFile(basePath,
    classOf[AvroKey[MyEvent]],
    classOf[NullWritable],
    classOf[LazyOutputFormat[AvroKey[MyEvent], NullWritable]],
    conf)

避免在 Spark Streaming 中为空分区写入文件

Avoid write files for empty partitions in Spark Streaming

hadoop

apache-kafka

apache-spark

spark-streaming

spark-streaming-kafka