我得到了 flink k8s ha 的错误。作业 00000000000000000000000000000000 不处于状态 运行,而是已安排。中止检查点
I got an error for flink k8s ha. job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint
当我将 flink 作业应用到 k8s zookeeper ha 时,出现以下错误。
我们的结构是工作集群。 1 份工作和 1 项任务。我们希望在删除作业 pod 时实现任务仍然可以继续工作。
job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint
下面是我的配置
high-availability: zookeeper
high-availability.storageDir: file:///opt/flink/data/
high-availability.zookeeper.quorum: zk-0.zk-hs:2181,zk-1.zk-hs:2181,zk-2.zk-hs:2181
high-availability.zookeeper.client.acl: open
high-availability.zookeeper.path.root: /flinkha
high-availability.cluster-id: /flink-job-service-kpi-ofcwy
以下是错误日志:
2020-06-19 12:56:02,254 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper.
2020-06-19 12:56:02,293 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 0 checkpoints in ZooKeeper.
2020-06-19 12:56:02,293 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 0 checkpoints from storage.
2020-06-19 12:56:02,312 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
2020-06-19 12:56:02,454 INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager runner for job KPI service job (00000000000000000000000000000000) was granted leadership with session id 9644799b-29cf-4ec5-9e68-5e45261aefb2 at akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0.
2020-06-19 12:56:02,532 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2020-06-19 12:56:02,534 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job KPI service job (00000000000000000000000000000000) under job master id 9e685e45261aefb29644799b29cf4ec5.
2020-06-19 12:56:02,552 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job KPI service job (00000000000000000000000000000000) switched from state CREATED to RUNNING.
2020-06-19 12:56:02,575 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) (6aeaf74d5a4ee58579e79fa1d3026535) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,618 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}]
2020-06-19 12:56:02,634 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Process -> Flat Map (1/1) (4ac2344f71fb9b6beb4a42fe18cf77a2) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,636 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(60000), ProcessingTimeTrigger, DistinctCountAggregateFunction, PassThroughWindowFunction) -> Map (1/1) (1fbb13647621f5e48db6f7d750c32865) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,636 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map -> (Sink: Unnamed, Sink: Print to Std. Out) (1/1) (46396671fce9498171d03a31b1cee968) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,655 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/resourcemanager(82039211570997fc83bd52bafb394879)
2020-06-19 12:56:02,674 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2020-06-19 12:56:02,677 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2020-06-19 12:56:02,692 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/00000000000000000000000000000000/job_manager_lock.
2020-06-19 12:56:02,693 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000.
2020-06-19 12:56:02,753 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000.
2020-06-19 12:56:02,775 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 82039211570997fc83bd52bafb394879.
2020-06-19 12:56:02,775 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2020-06-19 12:56:02,777 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 00000000000000000000000000000000 with allocation id dcc3d3f3537cd3f1032fe47a0aafe577.
2020-06-19 12:56:40,983 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
2020-06-19 12:57:40,982 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
通过配置服务解决。缺少以下配置。
high-availability.jobmanager.port: 6070
当我将 flink 作业应用到 k8s zookeeper ha 时,出现以下错误。
我们的结构是工作集群。 1 份工作和 1 项任务。我们希望在删除作业 pod 时实现任务仍然可以继续工作。
job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint
下面是我的配置
high-availability: zookeeper
high-availability.storageDir: file:///opt/flink/data/
high-availability.zookeeper.quorum: zk-0.zk-hs:2181,zk-1.zk-hs:2181,zk-2.zk-hs:2181
high-availability.zookeeper.client.acl: open
high-availability.zookeeper.path.root: /flinkha
high-availability.cluster-id: /flink-job-service-kpi-ofcwy
以下是错误日志:
2020-06-19 12:56:02,254 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2020-06-19 12:56:02,293 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 0 checkpoints in ZooKeeper. 2020-06-19 12:56:02,293 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 0 checkpoints from storage. 2020-06-19 12:56:02,312 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. 2020-06-19 12:56:02,454 INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager runner for job KPI service job (00000000000000000000000000000000) was granted leadership with session id 9644799b-29cf-4ec5-9e68-5e45261aefb2 at akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0. 2020-06-19 12:56:02,532 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2020-06-19 12:56:02,534 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job KPI service job (00000000000000000000000000000000) under job master id 9e685e45261aefb29644799b29cf4ec5. 2020-06-19 12:56:02,552 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job KPI service job (00000000000000000000000000000000) switched from state CREATED to RUNNING. 2020-06-19 12:56:02,575 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) (6aeaf74d5a4ee58579e79fa1d3026535) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,618 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}] 2020-06-19 12:56:02,634 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Process -> Flat Map (1/1) (4ac2344f71fb9b6beb4a42fe18cf77a2) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,636 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(60000), ProcessingTimeTrigger, DistinctCountAggregateFunction, PassThroughWindowFunction) -> Map (1/1) (1fbb13647621f5e48db6f7d750c32865) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,636 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map -> (Sink: Unnamed, Sink: Print to Std. Out) (1/1) (46396671fce9498171d03a31b1cee968) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,655 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/resourcemanager(82039211570997fc83bd52bafb394879) 2020-06-19 12:56:02,674 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration 2020-06-19 12:56:02,677 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms) 2020-06-19 12:56:02,692 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/00000000000000000000000000000000/job_manager_lock. 2020-06-19 12:56:02,693 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000. 2020-06-19 12:56:02,753 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000. 2020-06-19 12:56:02,775 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 82039211570997fc83bd52bafb394879. 2020-06-19 12:56:02,775 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. 2020-06-19 12:56:02,777 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 00000000000000000000000000000000 with allocation id dcc3d3f3537cd3f1032fe47a0aafe577. 2020-06-19 12:56:40,983 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint. 2020-06-19 12:57:40,982 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
通过配置服务解决。缺少以下配置。
high-availability.jobmanager.port: 6070