我可以获得DStream中每个RDD的最大键吗?

Can I get the maximal key of every RDD in DStream?

我需要找到每个RDD的最大键,但是当使用reduce()时,我能得到的是整个Dstream中最大的一个。 例如,在这个流中,我想要返回的是(2,"b"),(2,"d"),(3,"f"),但我只能得到( 3,"f") 来自 reduce(max) 我怎样才能得到 (2,"b"),(2,"d"),(3,"f")?

sc = SparkContext(appName="PythonStreamingQueueStream")
ssc = StreamingContext(sc, 1)
stream = ssc.queueStream([sc.parallelize([(1,"a"), (2,"b"),(1,"c"),(2,"d"),
(1,"e"),(3,"f")],3)])

stream.reduce(max).pprint()
ssc.start()
ssc.stop(stopSparkContext=True, stopGraceFully=True)

这个:

stream = ssc.queueStream([sc.parallelize([(1,"a"), (2,"b"),(1,"c"),(2,"d"),
  (1,"e"),(3,"f")],3)])

创建一个只有一个批次的流,其中第一个也是唯一一个批次有(最少)3 个分区。我想你想要:

stream = ssc.queueStream([
    sc.parallelize([(1,"a"), (2,"b")]),
    sc.parallelize([(1,"c"), (2,"d")]), 
    sc.parallelize([(1,"e"), (3,"f")]), 
])

这会给你预期的结果:

stream.reduce(max).pprint()