Flink 的批量执行模式如何实现一个 BOUNDED source？

Question

我正在尝试执行 Flink (1.12.1) 批处理作业，步骤如下：

自定义 SourceFunction 以连接 MongoDB
做任何平面图和地图来转换一些数据
将其沉入其他 MongoDB

我试图在 StreamExecutionEnvironment 中运行它，使用 RuntimeExexutionMode.BATCH，但应用程序抛出异常，因为检测到我的源为 UNBOUNDED...而且我无法将它设置为 BOUNDED （它必须在收集 mongo 集合中的所有文档后完成）

异常：

    exception in thread "main" java.lang.IllegalStateException: Detected an UNBOUNDED source with the 'execution.runtime-mode' set to 'BATCH'. This combination is not allowed, please set the 'execution.runtime-mode' to STREAMING or AUTOMATIC
        at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
        at org.apache.flink.streaming.api.graph.StreamGraphGenerator.shouldExecuteInBatchMode(StreamGraphGenerator.java:335)
        at org.apache.flink.streaming.api.graph.StreamGraphGenerator.generate(StreamGraphGenerator.java:258)
        at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getStreamGraph(StreamExecutionEnvironment.java:1958)
        at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getStreamGraph(StreamExecutionEnvironment.java:1943)
        at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782)
        at com.grupotsk.bigdata.matadatapmexporter.MetadataPMExporter.main(MetadataPMExporter.java:33)

一些代码：

执行环境

public static StreamExecutionEnvironment getBatch() {
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setRuntimeMode(RuntimeExecutionMode.BATCH);
    
    env.addSource(new MongoSource()).print();
    
    return env;
    
}

Mongo 来源：

public class MongoSource extends RichSourceFunction<Document> {

    private static final long serialVersionUID = 8321722349907219802L;
    private MongoClient mongoClient;
    private MongoCollection mc;
    
    
    @Override
    public void open(Configuration con) {
        mongoClient = new MongoClient(
                new MongoClientURI("mongodb://localhost:27017/database"));
        
        mc=mongoClient.getDatabase("database").getCollection("collection");
        
    }
    
    @Override
    public void run(SourceContext<Document> ctx) throws Exception {
        
        MongoCursor<Document> itr=mc.find(Document.class).cursor();
        while(itr.hasNext())
            ctx.collect(itr.next());
        this.cancel();
        
    }

    @Override
    public void cancel() {
        mongoClient.close();
        
    }

谢谢！

Answer 1

与 RuntimeExecutionMode.BATCH 一起使用的源必须实现 Source 而不是 SourceFunction。接收器应该实现 Sink 而不是 SinkFunction.

见Integrating Flink into your ecosystem - How to build a Flink connector from scratch for an introduction to these new interfaces. They are described in FLIP-27: Refactor Source Interface and FLIP-143: Unified Sink API。

Flink 的批量执行模式如何实现一个 BOUNDED source？

How to implement a BOUNDED source for Flink's batch execution mode?

java

mongodb

apache-flink