IntelliJ 中 Flink WordCount 输出中的数字

Question

我正在学习 Apache Flink，并通过 Maven 将其集成到 IntelliJ 中。我尝试了 GitHub 中的这个 WordCount 示例： WordCount example from GitHub

我只是简单地调整了输入文本。

生成输出的代码的主要部分是：

DataStream<Tuple2<String, Integer>> counts =
                // split up the lines in pairs (2-tuples) containing: (word,1)
                text.flatMap(new Tokenizer())
                        // group by the tuple field "0" and sum up tuple field "1"
                        .keyBy(value -> value.f0)
                        .sum(1);

        // emit result
        if (params.has("output")) {
            counts.writeAsText(params.get("output"));
        } else {
            System.out.println("Printing result to stdout. Use --output to specify output path.");
            counts.print();
        }
        // execute program
        env.execute("Streaming WordCount");

我在 IntelliJ 中得到以下输出

4> (name,1)
4> (years,1)
3> (hello,1)
3> (twice,1)
5> (the,1)
2> (i,1)
6> (my,1)
2> (am,1)
6> (florian,1)
7> (old,1)
2> (thirteen,1)
6> (word,1)
8> (is,1)
8> (is,2)
6> (florian,2)
6> (written,1)

所以我有两个问题：

“$NUMBER>”符号代表什么？这些是我的 Apache Flink 集群的工作人员的 ID 吗？哪一行代码执行此操作以及如何在输出中删除它们？在文档中找不到它。
单词“florian”在输出中出现了两次，如在文本中一样。这是由于子任务被写入输出吗？所以每增加一个字数，新的字数就写入输出？是否可以汇总这些，以便仅写入最终计数？

我知道这些是非常基本的问题，但我是 Apache Flink 的新手，也是一般分布式处理框架的新手，但我很想学习它。所以提前谢谢！ :)

Answer 1

WordCoiunt 是流媒体中的“hello world”space。

NUMBER 表示重复次数
“florian”在您的输入中出现两次，第一次出现在 (florian,1) 中，第二次出现在 (florian,2) 中，如果您在输入中添加另一个“florian”，flink 将计数并显示(弗洛里安,3)

Answer 2

是的，“$NUMBER>”向您显示了哪一个并行工作进程产生了该行输出。这来自 PrintSink，它实际上并不打算用于生产。相反，如果您提供 --output 参数，则此示例将使用 writeAsText 写入文件，我不相信这些“$NUMBER>”前缀会出现在那里。或者您可以使用 StreamingFileSink 或 FileSink.
您正在运行以 STREAMING 执行模式运行此应用程序。当运行以这种方式连接时，Flink 无法知道它会看到多少输入——它被设计为能够永远运行连续。由于不可能等到“结束”才能生成一个单一的、最终的字数统计报告，相反，每个输入记录都会导致生成一个更新的输出记录。您可以改为运行此应用程序处于 BATCH 执行模式，假设它只提供有界输入，在这种情况下它将运行完成并仅报告最终字数。有关详细信息，请参阅 https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_execution_mode.html。

IntelliJ 中 Flink WordCount 输出中的数字

Numbers in output of Flink WordCount in IntelliJ

java

apache-flink

flink-streaming