如何使用Spark Streaming计算来自Kafka的过去25天流数据的均值和方差
How to use Spark Streaming calculate the mean and variance of streaming data in the past 25 days from Kafka
kafka中有流式数据,连续浮点数:
2016-11-2311:00:00|12.2
2016-11-2311:03:00|13.2
2016-11-2311:05:00|15.1
......
我想计算过去 25 天内这些浮点数在 11:00am 和 12:00am 之间的均值和方差。
Spark Streaming 适合处理这个问题吗?
非常感谢!
@Ming,你可以用这个作为摘要
val sparkConf = new SparkConf().setAppName("StreamCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
//update the time according to your need
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// Get the lines, and timestamp data along with the float values
SELECT float_number
FROM [YourTable]
WHERE [YourDate] BETWEEN DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0) + '11:00' AND DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0) + '12:00'
//store it to a data frame
df.select(avg($"float_number")).show()
kafka中有流式数据,连续浮点数:
2016-11-2311:00:00|12.2
2016-11-2311:03:00|13.2
2016-11-2311:05:00|15.1
......
我想计算过去 25 天内这些浮点数在 11:00am 和 12:00am 之间的均值和方差。
Spark Streaming 适合处理这个问题吗?
非常感谢!
@Ming,你可以用这个作为摘要
val sparkConf = new SparkConf().setAppName("StreamCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
//update the time according to your need
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// Get the lines, and timestamp data along with the float values
SELECT float_number
FROM [YourTable]
WHERE [YourDate] BETWEEN DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0) + '11:00' AND DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0) + '12:00'
//store it to a data frame
df.select(avg($"float_number")).show()