Scala Apache Spark 中 DStream 的输出内容
Output contents of DStream in Scala Apache Spark
下面的 Spark 代码似乎没有对文件执行任何操作example.txt
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("filter")
.setSparkHome("C:\spark\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("C:\example.txt")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
我正在尝试使用 dataFile.print()
打印文件的前 10 个元素
一些生成的输出:
15/03/12 12:23:53 INFO JobScheduler: Started JobScheduler
15/03/12 12:23:54 INFO FileInputDStream: Finding new files took 105 ms
15/03/12 12:23:54 INFO FileInputDStream: New files at time 1426163034000 ms:
15/03/12 12:23:54 INFO JobScheduler: Added jobs for time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Starting job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
-------------------------------------------
Time: 1426163034000 ms
-------------------------------------------
15/03/12 12:23:54 INFO JobScheduler: Finished job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Total delay: 0.157 s for time 1426163034000 ms (execution: 0.006 s)
15/03/12 12:23:54 INFO FileInputDStream: Cleared 0 old files that were older than 1426162974000 ms:
15/03/12 12:23:54 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:55 INFO FileInputDStream: Finding new files took 2 ms
15/03/12 12:23:55 INFO FileInputDStream: New files at time 1426163035000 ms:
15/03/12 12:23:55 INFO JobScheduler: Added jobs for time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Starting job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
-------------------------------------------
Time: 1426163035000 ms
-------------------------------------------
15/03/12 12:23:55 INFO JobScheduler: Finished job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Total delay: 0.011 s for time 1426163035000 ms (execution: 0.001 s)
15/03/12 12:23:55 INFO MappedRDD: Removing RDD 1 from persistence list
15/03/12 12:23:55 INFO BlockManager: Removing RDD 1
15/03/12 12:23:55 INFO FileInputDStream: Cleared 0 old files that were older than 1426162975000 ms:
15/03/12 12:23:55 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:56 INFO FileInputDStream: Finding new files took 3 ms
15/03/12 12:23:56 INFO FileInputDStream: New files at time 1426163036000 ms:
example.txt
的格式为:
gdaeicjdcg,194,155,98,107
jhbcfbdigg,73,20,122,172
ahdjfgccgd,28,47,40,178
afeidjjcef,105,164,37,53
afeiccfdeg,29,197,128,85
aegddbbcii,58,126,89,28
fjfdbfaeid,80,89,180,82
如 print
文档所述:
/**
* 打印在这个 DStream 中生成的每个 RDD 的前十个元素。这是一个输出
* 运算符,因此此 DStream 将被注册为输出流并在那里具体化。
*/
这是否意味着已为此流生成了 0 个 RDD?如果想查看 RDD 的内容,则使用 Apache Spark 将使用 RDD 的收集功能。这些是 Streams 的类似方法吗?那么简而言之如何打印到 Stream 的控制台内容?
更新:
根据@0x0FFF 评论更新了代码。 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html 似乎没有给出从本地文件系统读取的示例。这不像使用 Spark 核心那样常见吗,那里有从文件读取数据的明确示例?
这是更新后的代码:
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\spark\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("file:///c:/data/")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
但是输出是一样的。当我将新文件添加到 c:\data
目录(其格式与现有数据文件相同)时,它们不会被处理。我假设 dataFile.print
应该打印前 10 行到控制台 ?
更新 2:
也许这与我在 Windows 环境中 运行 这段代码有关?
您误解了 textFileStream
的用法。以下是 Spark 文档中的描述:
创建一个输入流来监视 Hadoop 兼容文件系统的新文件并将它们作为文本文件读取(使用键作为 LongWritable,值作为文本,输入格式作为 TextInputFormat)。
因此,首先,您应该将目录传递给它,其次,该目录应该可以从接收方节点 运行 获得,因此最好使用 HDFS 来实现此目的。然后当你把一个 new 文件放到这个目录时,它会被函数 print()
处理并且打印前 10 行为了它
更新:
我的代码:
[alex@sparkdemo tmp]$ pyspark --master local[2]
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
s15/03/12 06:37:49 WARN Utils: Your hostname, sparkdemo resolves to a loopback address: 127.0.0.1; using 192.168.208.133 instead (on interface eth0)
15/03/12 06:37:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Python version 2.6.6 (r266:84292, Nov 22 2013 12:16:22)
SparkContext available as sc.
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 30)
>>> dataFile = ssc.textFileStream('file:///tmp')
>>> dataFile.pprint()
>>> ssc.start()
>>> ssc.awaitTermination()
-------------------------------------------
Time: 2015-03-12 06:40:30
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:00
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:30
-------------------------------------------
1 2 3
4 5 6
7 8 9
-------------------------------------------
Time: 2015-03-12 06:42:00
-------------------------------------------
这是我编写的自定义接收器,用于在指定目录中侦听数据:
package receivers
import java.io.File
import org.apache.spark.{ SparkConf, Logging }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.receiver.Receiver
class CustomReceiver(dir: String)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("File Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
}
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
private def receive() {
for (f <- recursiveListFiles(new File(dir))) {
val source = scala.io.Source.fromFile(f)
val lines = source.getLines
store(lines)
source.close()
logInfo("Stopped receiving")
restart("Trying to connect again")
}
}
}
我认为需要注意的一件事是文件需要在 <= 配置的时间内处理 batchDuration
。在下面的示例中,它设置为 10 秒,但如果接收方处理文件的时间超过 10 秒,则将不会处理某些数据文件。关于这一点,我愿意接受更正。
自定义接收器的实现方式如下:
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\spark\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(10))
val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomReceiver("C:\data\"))
customReceiverStream.print
customReceiverStream.foreachRDD(m => {
println("size is " + m.collect.size)
})
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
更多信息:
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html&
https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html
我可能发现了你的问题,你的日志中应该有这个:
WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
问题是您需要至少 2 个核心才能 运行 火花流应用程序。
所以解决方案应该是简单地替换:
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
作者:
val conf = new org.apache.spark.SparkConf()
.setMaster("local[*]")
或者至少不止一个。
下面的 Spark 代码似乎没有对文件执行任何操作example.txt
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("filter")
.setSparkHome("C:\spark\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("C:\example.txt")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
我正在尝试使用 dataFile.print()
一些生成的输出:
15/03/12 12:23:53 INFO JobScheduler: Started JobScheduler
15/03/12 12:23:54 INFO FileInputDStream: Finding new files took 105 ms
15/03/12 12:23:54 INFO FileInputDStream: New files at time 1426163034000 ms:
15/03/12 12:23:54 INFO JobScheduler: Added jobs for time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Starting job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
-------------------------------------------
Time: 1426163034000 ms
-------------------------------------------
15/03/12 12:23:54 INFO JobScheduler: Finished job streaming job 1426163034000 ms.0 from job set of time 1426163034000 ms
15/03/12 12:23:54 INFO JobScheduler: Total delay: 0.157 s for time 1426163034000 ms (execution: 0.006 s)
15/03/12 12:23:54 INFO FileInputDStream: Cleared 0 old files that were older than 1426162974000 ms:
15/03/12 12:23:54 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:55 INFO FileInputDStream: Finding new files took 2 ms
15/03/12 12:23:55 INFO FileInputDStream: New files at time 1426163035000 ms:
15/03/12 12:23:55 INFO JobScheduler: Added jobs for time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Starting job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
-------------------------------------------
Time: 1426163035000 ms
-------------------------------------------
15/03/12 12:23:55 INFO JobScheduler: Finished job streaming job 1426163035000 ms.0 from job set of time 1426163035000 ms
15/03/12 12:23:55 INFO JobScheduler: Total delay: 0.011 s for time 1426163035000 ms (execution: 0.001 s)
15/03/12 12:23:55 INFO MappedRDD: Removing RDD 1 from persistence list
15/03/12 12:23:55 INFO BlockManager: Removing RDD 1
15/03/12 12:23:55 INFO FileInputDStream: Cleared 0 old files that were older than 1426162975000 ms:
15/03/12 12:23:55 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
15/03/12 12:23:56 INFO FileInputDStream: Finding new files took 3 ms
15/03/12 12:23:56 INFO FileInputDStream: New files at time 1426163036000 ms:
example.txt
的格式为:
gdaeicjdcg,194,155,98,107
jhbcfbdigg,73,20,122,172
ahdjfgccgd,28,47,40,178
afeidjjcef,105,164,37,53
afeiccfdeg,29,197,128,85
aegddbbcii,58,126,89,28
fjfdbfaeid,80,89,180,82
如 print
文档所述:
/** * 打印在这个 DStream 中生成的每个 RDD 的前十个元素。这是一个输出 * 运算符,因此此 DStream 将被注册为输出流并在那里具体化。 */
这是否意味着已为此流生成了 0 个 RDD?如果想查看 RDD 的内容,则使用 Apache Spark 将使用 RDD 的收集功能。这些是 Streams 的类似方法吗?那么简而言之如何打印到 Stream 的控制台内容?
更新:
根据@0x0FFF 评论更新了代码。 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html 似乎没有给出从本地文件系统读取的示例。这不像使用 Spark 核心那样常见吗,那里有从文件读取数据的明确示例?
这是更新后的代码:
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\spark\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(1))
val dataFile: DStream[String] = ssc.textFileStream("file:///c:/data/")
dataFile.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
但是输出是一样的。当我将新文件添加到 c:\data
目录(其格式与现有数据文件相同)时,它们不会被处理。我假设 dataFile.print
应该打印前 10 行到控制台 ?
更新 2:
也许这与我在 Windows 环境中 运行 这段代码有关?
您误解了 textFileStream
的用法。以下是 Spark 文档中的描述:
创建一个输入流来监视 Hadoop 兼容文件系统的新文件并将它们作为文本文件读取(使用键作为 LongWritable,值作为文本,输入格式作为 TextInputFormat)。
因此,首先,您应该将目录传递给它,其次,该目录应该可以从接收方节点 运行 获得,因此最好使用 HDFS 来实现此目的。然后当你把一个 new 文件放到这个目录时,它会被函数 print()
处理并且打印前 10 行为了它
更新:
我的代码:
[alex@sparkdemo tmp]$ pyspark --master local[2]
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
s15/03/12 06:37:49 WARN Utils: Your hostname, sparkdemo resolves to a loopback address: 127.0.0.1; using 192.168.208.133 instead (on interface eth0)
15/03/12 06:37:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Python version 2.6.6 (r266:84292, Nov 22 2013 12:16:22)
SparkContext available as sc.
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 30)
>>> dataFile = ssc.textFileStream('file:///tmp')
>>> dataFile.pprint()
>>> ssc.start()
>>> ssc.awaitTermination()
-------------------------------------------
Time: 2015-03-12 06:40:30
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:00
-------------------------------------------
-------------------------------------------
Time: 2015-03-12 06:41:30
-------------------------------------------
1 2 3
4 5 6
7 8 9
-------------------------------------------
Time: 2015-03-12 06:42:00
-------------------------------------------
这是我编写的自定义接收器,用于在指定目录中侦听数据:
package receivers
import java.io.File
import org.apache.spark.{ SparkConf, Logging }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.receiver.Receiver
class CustomReceiver(dir: String)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("File Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
}
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
private def receive() {
for (f <- recursiveListFiles(new File(dir))) {
val source = scala.io.Source.fromFile(f)
val lines = source.getLines
store(lines)
source.close()
logInfo("Stopped receiving")
restart("Trying to connect again")
}
}
}
我认为需要注意的一件事是文件需要在 <= 配置的时间内处理 batchDuration
。在下面的示例中,它设置为 10 秒,但如果接收方处理文件的时间超过 10 秒,则将不会处理某些数据文件。关于这一点,我愿意接受更正。
自定义接收器的实现方式如下:
val conf = new org.apache.spark.SparkConf()
.setMaster("local[2]")
.setAppName("filter")
.setSparkHome("C:\spark\spark-1.2.1-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val ssc = new StreamingContext(conf, Seconds(10))
val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomReceiver("C:\data\"))
customReceiverStream.print
customReceiverStream.foreachRDD(m => {
println("size is " + m.collect.size)
})
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
更多信息: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html& https://spark.apache.org/docs/1.2.0/streaming-custom-receivers.html
我可能发现了你的问题,你的日志中应该有这个:
WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
问题是您需要至少 2 个核心才能 运行 火花流应用程序。 所以解决方案应该是简单地替换:
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
作者:
val conf = new org.apache.spark.SparkConf()
.setMaster("local[*]")
或者至少不止一个。