火花流中两种类型的联合之间有什么不同

Question

Dstream 提供两种类型 union :

StreamingContext.union(Dstreams)

Dstream.union(anotherDstream)

所以我想知道有什么不同，尤其是在并行性能方面。

Answer 1

看这两个操作的源代码，除了一个以单个DStream作为输入，一个以列表为输入外，没有区别。

StreamingContext:

def union[T: ClassTag](streams: Seq[DStream[T]]): DStream[T] = withScope {
  new UnionDStream[T](streams.toArray)
}

Dstream:

def union(that: DStream[T]): DStream[T] = ssc.withScope {
  new UnionDStream[T](Array(this, that))
}

因此，您使用哪个取决于您的喜好，没有任何性能提升。当你有一个要合并的流列表时，StreamingConext 中的方法稍微简化了代码，因此，在这种情况下它可能更可取。

Answer 2

您的说法“DStream 提供两种类型的联合”不太正确。

ref 提到了不同的签名，更具体地说是提供联合操作的不同 classes。

StreamingContext.union(*dstreams)

Create a unified DStream from multiple DStreams of the same type and same slide duration.

DStream.union(other)

Return a new DStream by unifying data of another DStream with this DStream. Parameters: other – Another DStream having the same interval (i.e., slideDuration) as this DStream.

后面在Spark User List讨论："The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs".

StreamingContext的源代码：

def union(self, *dstreams):
    ...
    first = dstreams[0]
    jrest = [d._jdstream for d in dstreams[1:]]
    return DStream(self._jssc.union(first._jdstream, jrest), self, first._jrdd_deserializer)

DStream的源代码：

def union(self, other):
    return self.transformWith(lambda a, b: a.union(b), other, True)

你可以看到第一个使用递归（正如预期的那样），而另一个使用transformWith，它在相同的class中定义并转换每个RDD。

要记住的是Level of Parallelism in Data Receiving，如果数据接收成为系统瓶颈，那么考虑并行化数据接收过程是个好主意。

因此，鼓励将 union() 方法应用于多个 DStreams 的过程`，从而提供了一种方法来轻松执行此操作，同时保持代码整洁。恕我直言，性能不会有差异。

火花流中两种类型的联合之间有什么不同

Is there any different between two types of union in spark streaming

performance

distributed-computing

data-stream

apache-spark

spark-streaming