跨 dstream 的不同元素
Distinct Element across dstream
我正在研究 window dstream,其中每个 dstream 包含 3 个带有以下键的 rdd:
a,b,c
b,c,d
c,d,e
d,e,f
我只想获取所有 dstream 中的唯一键
a,b,c,d,e,f
Spark Streaming 中如何实现?
我们可以使用 window 个 t+4 间隔来保持计数 "last recently seen keys" 并使用它来删除当前间隔的重复项。
内容大致如下:
// original dstream
val dstream = ???
// make distinct (for a single interval) and pair with 1's for counting
val keyedDstream = dstream.transform(rdd=> rdd.distinct).map(e => (e,1))
// keep a window of t*4 with the count of distinct keys we have seen
val windowed = keyedDstream.reduceByKeyAndWindow((x:Int,y:Int) => x+y, Seconds(4),Seconds(1))
// join the windowed count with the initially keyed dstream
val joined = keyedDstream.join(windowed)
// the unique keys though the window are those with a running count of 1 (only seen in the current interval)
val uniquesThroughWindow = joined.transform{rdd =>
rdd.collect{case (k,(current, prev)) if (prev == 1) => k}
}
我正在研究 window dstream,其中每个 dstream 包含 3 个带有以下键的 rdd:
a,b,c
b,c,d
c,d,e
d,e,f
我只想获取所有 dstream 中的唯一键
a,b,c,d,e,f
Spark Streaming 中如何实现?
我们可以使用 window 个 t+4 间隔来保持计数 "last recently seen keys" 并使用它来删除当前间隔的重复项。
内容大致如下:
// original dstream
val dstream = ???
// make distinct (for a single interval) and pair with 1's for counting
val keyedDstream = dstream.transform(rdd=> rdd.distinct).map(e => (e,1))
// keep a window of t*4 with the count of distinct keys we have seen
val windowed = keyedDstream.reduceByKeyAndWindow((x:Int,y:Int) => x+y, Seconds(4),Seconds(1))
// join the windowed count with the initially keyed dstream
val joined = keyedDstream.join(windowed)
// the unique keys though the window are those with a running count of 1 (only seen in the current interval)
val uniquesThroughWindow = joined.transform{rdd =>
rdd.collect{case (k,(current, prev)) if (prev == 1) => k}
}