为什么当我发送两个输入流时 Spark Streaming 停止工作？

Question

我正在开发一个 Spark Streaming 应用程序，我需要在其中使用来自 Python 中两个服务器的输入流，每个服务器每秒向 Spark 上下文发送一条 JSON 消息。

我的问题是，如果我只对一个流执行操作，则一切正常。但是如果我有来自不同服务器的两个流，那么 Spark 在它可以打印任何东西之前冻结，并且只有当两个服务器都发送了他们必须发送的所有 JSON 消息时才再次开始工作（当它检测到 'socketTextStream 没有接收数据。

这是我的代码：

    JavaReceiverInputDStream<String> streamData1 = ssc.socketTextStream("localhost",996,
            StorageLevels.MEMORY_AND_DISK_SER);

    JavaReceiverInputDStream<String> streamData2 = ssc.socketTextStream("localhost", 9995,StorageLevels.MEMORY_AND_DISK_SER);

    JavaPairDStream<Integer, String> dataStream1= streamData1.mapToPair(new PairFunction<String, Integer, String>() {
        public Tuple2<Integer, String> call(String stream) throws Exception {


            Tuple2<Integer,String> streamPair= new Tuple2<Integer, String>(1, stream);

            return streamPair;
        }
    });

    JavaPairDStream<Integer, String> dataStream2= streamData2.mapToPair(new PairFunction<String, Integer, String>() {
        public Tuple2<Integer, String> call(String stream) throws Exception {


            Tuple2<Integer,String> streamPair= new Tuple2<Integer, String>(2, stream);

            return streamPair;
        }
    });

dataStream2.print(); //for example

请注意，没有错误消息，启动上下文后 Spark 简单冻结，而当我从端口收到 JSON 消息时，它没有显示任何内容。

非常感谢。

Answer 1

查看 Spark Streaming documentation 中的这些注意事项，看看它们是否适用：

Points to remember

When running a Spark Streaming program locally, do not use “local” or “local1” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).

Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

为什么当我发送两个输入流时 Spark Streaming 停止工作？

Why does Spark Streaming stop working when I send two input streams?

java

sockets

apache-spark

spark-streaming