Spark Streaming 收到警告 "replicated to only 0 peer(s) instead of 1 peers"

Question

我使用 Spark Streaming 接收来自 Twitter 的推文。我收到很多警告说：

replicated to only 0 peer(s) instead of 1 peers

此警告的用途是什么？

我的代码是：

    SparkConf conf = new SparkConf().setAppName("Test");
    JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(5));
    sc.checkpoint("/home/arman/Desktop/checkpoint");

    ConfigurationBuilder cb = new ConfigurationBuilder();
    cb.setOAuthConsumerKey("****************")
        .setOAuthConsumerSecret("**************")
        .setOAuthAccessToken("*********************")
        .setOAuthAccessTokenSecret("***************");


    JavaReceiverInputDStream<twitter4j.Status> statuses = TwitterUtils.createStream(sc, 
            AuthorizationFactory.getInstance(cb.build()));

    JavaPairDStream<String, Long> hashtags = statuses.flatMapToPair(new GetHashtags());
    JavaPairDStream<String, Long> hashtagsCount = hashtags.updateStateByKey(new UpdateReduce());
    hashtagsCount.foreachRDD(new saveText(args[0], true));

    sc.start();
    sc.awaitTerminationOrTimeout(Long.parseLong(args[1]));
    sc.stop();

Answer 1

当使用 Spark Streaming 读取数据时，由于容错，传入的数据块至少被复制到另一个 node/worker。如果没有它，可能会发生运行time 从流中读取数据然后失败的情况，这部分数据将会丢失（它已经从流中读取并删除，并且由于失败它也在工作端丢失） .

参考Spark documentation：

While a Spark Streaming driver program is running, the system receives data from various sources and and divides it into batches. Each batch of data is treated as an RDD, that is, an immutable parallel collection of data. These input RDDs are saved in memory and replicated to two nodes for fault-tolerance.

您的警告意味着根本没有复制来自流的传入数据。原因可能是您运行应用程序只有一个 Spark worker 实例或运行在本地模式下运行。尝试启动更多的 Spark worker，看看警告是否消失。

Spark Streaming 收到警告 "replicated to only 0 peer(s) instead of 1 peers"

Spark Streaming get warn "replicated to only 0 peer(s) instead of 1 peers"

java

streaming

twitter4j

apache-spark

spark-streaming