使用 filstream 的 Spark streaming wordcount 不打印结果

Question

我将文件用作 Spark 流，我想计算流中的字数，但应用程序不打印任何内容，这是我的代码。我在 Cloudera 环境中使用 Scala

 import org.apache.spark.SparkConf
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.StreamingContext

 object TwitterHashtagStreaming {

 def main(args: Array[String]) : Unit = {

val conf = new SparkConf().setAppName("TwitterHashtagStreaming").setMaster("local[2]").set("spark.executor.memory","1g");

val streamingC = new StreamingContext(conf,Seconds(5))

val streamLines = streamingC.textFileStream("file:///home/cloudera/Desktop/wordstream")
val words = streamLines.flatMap(_.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey(_ + _)

 counts.print()

 streamingC.start()
 streamingC.awaitTermination()
}

 }

Answer 1

请仔细参考document:

def textFileStream(directory: String): DStream[String]

Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat). Files must be written to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.

总之，它是一个变化检测器，你必须启动你的流媒体服务，然后将你的数据写入你的监控目录。

这个语义在实际部署到生产环境时会模拟"stream concept"，比如网络包会像你的文件一样逐渐传入.

使用 filstream 的 Spark streaming wordcount 不打印结果

Spark streaming wordcount using filstream doesn't print result

scala

filestream

apache-spark

spark-streaming