1.配置

flink-conf.yaml添加配置:

high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/recovery
high-availability.zookeeper.quorum: xxx
high-availability.zookeeper.path.root: /flink-ha
 
yarn.application-attempts: 2

注意:
1)yarn模式下不需要配置high-availability.cluster-id,standalone模式才需要
2)yarn.application-attempts在ha没启用时默认为1,启用ha后默认为2。没有开启ha的情况下,设置yarn.application-attempts大于1 ,会导致任务从旧的checkpoint点重新启动,是一个比较危险的操作,所以非ha情况下最好不要设置yarn.application-attempts
3)yarn.application-attempts的配置受限于yarn的yarn.resourcemanager.am.max-attempts配置

2.故障演练

不开启HA情况下,NodeManager或者JobManager服务挂掉,会导致任务直接挂掉,不具备故障自动恢复能力
下面的故障演练都是基于HA的

(1)杀JobManager进程

JobManager和TaskManager会自动转移到新的节点重启,且是从最新的checkpoint恢复
整个过程是很快的,基本感知不到有发生故障转移

(2)杀JobManager所在机器的NodeManager进程

JobManager进程不受影响,任务也能正常执行
十几分钟之后,JobManager和TaskManager会自动转移到新的节点重启,且是从最新的checkpoint恢复

观察ResourceManager日志:

2022-05-18 11:30:39,662 INFO  util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(148)) - Expired:hdfs08-dev.yingzi.com:45454 Timed out after 600 secs
2022-05-18 11:30:39,666 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:deactivateNode(1077)) - Deactivating Node hdfs08-dev.yingzi.com:45454 as it is now LOST
2022-05-18 11:30:39,667 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(671)) - hdfs08-dev.yingzi.com:45454 Node Transitioned from RUNNING to LOST
2022-05-18 11:30:39,675 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e142_1652757164854_0006_11_000001 Container Transitioned from RUNNING to KILLED
2022-05-18 11:30:39,680 INFO  capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1960)) - Removed node hdfs08-dev.yingzi.com:45454 clusterResource: <memory:143360, vCores:224>

默认10分钟(取决于yarn.nm.liveness-monitor.expiry-interval-ms配置)才会把NodeManager标记为LOST,然后移除节点和容器

期间如果手动启动NodeManager,则JobManager和TaskManager不会转移

如果不开启HA,则10分钟后任务会直接挂掉

(3)先杀JobManager所在机器的NodeManager进程,再杀JobManager进程

刚开始TaskManager进程还存活,但是任务无法正常执行,5分钟之后TaskManager进程挂掉
查看TaskManager日志,可以看出主要是与JobManager的连接断开导致:

2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job b5a0c61ab1b1495dd9c2f8e45d349957 from job leader monitoring.
2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/b5a0c61ab1b1495dd9c2f8e45d349957/job_manager_lock'}.
2022-05-18 11:37:15,056 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close JobManager connection for job b5a0c61ab1b1495dd9c2f8e45d349957.
2022-05-18 11:37:15,079 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close ResourceManager connection 27c66057868b763b46426beea8666578.
2022-05-18 11:37:25,014 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot find task to fail for execution fc957691a7fbc7f3ac92f904b7840190 with exception:
java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.jobmaster.JobMasterGateway.updateTaskExecutionState(org.apache.flink.runtime.taskmanager.TaskExecutionState) timed out.

...

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hdfs05-dev.yingzi.com:16897/user/rpc/jobmanager_2#-1091134734]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.

...

2022-05-18 11:42:15,096 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.

十几分钟之后,JobManager和TaskManager会自动转移到新的节点重启,且是从最新的checkpoint恢复,任务也恢复正常

观察ResourceManager日志:

2022-05-17 17:06:08,398 INFO  util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(148)) - Expired:hdfs03-dev.yingzi.com:45454 Timed out after 600 secs
2022-05-17 17:06:08,399 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:deactivateNode(1077)) - Deactivating Node hdfs03-dev.yingzi.com:45454 as it is now LOST
2022-05-17 17:06:08,399 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(671)) - hdfs03-dev.yingzi.com:45454 Node Transitioned from RUNNING to LOST

默认10分钟(取决于yarn.nm.liveness-monitor.expiry-interval-ms配置)才会把NodeManager标记为LOST,然后移除节点和容器

如果把yarn.nm.liveness-monitor.expiry-interval-ms或者yarn.am.liveness-monitor.expiry-interval-ms(默认均为10分钟)值其中一个调小,JobManager和TaskManager会提前重启
即NodeManager和JobManager同时挂掉,任务恢复时间由nn和am的超时时间共同决定

期间如果手动启动NodeManager,任务也能够自动恢复

(4)杀TaskManager进程

TaskManager转移到新的节点重启,但是任务无法正常执行,并且checkpoint操作也是失败的
看日志JobManager还在尝试与旧节点的TaskManager通信,每间隔10秒报一次connection refused的异常
直到10分钟(与heartbeat.timeout值相关)后日志显示

2022-05-18 14:39:05,553 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@hdfs05-dev.yingzi.com:23065] has failed, address is now gated for [50] ms. Reason: [Disassociated]

2022-05-18 14:40:06,013 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job b5a0c61ab1b1495dd9c2f8e45d349957 with leader id 84a8fd94fa276bf476d3af016c1441be lost leadership.
2022-05-18 14:40:06,017 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close JobManager connection for job b5a0c61ab1b1495dd9c2f8e45d349957.
2022-05-18 14:40:06,018 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to fail task externally Source: TableSourceScan(table=[[default_catalog, default_database, anc_herd_property_val]], fields=[id, create_user, modify_user, create_time, modify_time, app_id, tenant_id, deleted, animal_id, property_id, property_val]) -> MiniBatchAssigner(interval=[2000ms], mode=[ProcTime]) -> DropUpdateBefore (1/1)#0 (9c5ac532a5f8459790af009cbc74fd3a).
2022-05-18 14:40:06,020 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Source: TableSourceScan(table=[[default_catalog, default_database, anc_herd_property_val]], fields=[id, create_user, modify_user, create_time, modify_time, app_id, tenant_id, deleted,animal_id, property_id, property_val]) -> MiniBatchAssigner(interval=[2000ms], mode=[ProcTime]) -> DropUpdateBefore (1/1)#0 (9c5ac532a5f8459790af009cbc74fd3a) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for b5a0c61ab1b1495dd9c2f8e45d349957.
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
...
Caused by: java.lang.Exception: Job leader for job id b5a0c61ab1b1495dd9c2f8e45d349957 lost leadership.
...

2022-05-18 14:40:16,042 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=837.120mb (877783849 bytes), taskOffHeapMemory=0 bytes, managedMemory=2.010gb (2158221148 bytes), networkMemory=343.040mb (359703515 bytes)}, allocationId: 85159a1bbf11b3993c17d8adcbc9694c, jobId: b5a0c61ab1b1495dd9c2f8e45d349957).
2022-05-18 14:40:16,050 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Remove job b5a0c61ab1b1495dd9c2f8e45d349957 from job leader monitoring.
2022-05-18 14:40:16,050 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
2022-05-18 14:40:16,050 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/b5a0c61ab1b1495dd9c2f8e45d349957/job_manager_lock'}.
2022-05-18 14:40:16,109 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot find task to fail for execution 373d61b162d19ffda23ea2906f0cb7a9 with exception:
java.util.concurrent.TimeoutException: Invocation of public abstract java.util.concurrent.CompletableFuture org.apache.flink.runtime.jobmaster.JobMasterGateway.updateTaskExecutionState(org.apache.flink.runtime.taskmanager.TaskExecutionState) timed out.
	at com.sun.proxy.$Proxy36.updateTaskExecutionState(Unknown Source) ~[?:?]
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.updateTaskExecutionState(TaskExecutor.java:1852) ~[flink-dist_2.12-1.13.2.jar:1.13.2]
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.unregisterTaskAndNotifyFinalState(TaskExecutor.java:1885) ~[flink-dist_2.12-1.13.2.jar:1.13.2]

...

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@hdfs05-dev.yingzi.com:23065/user/rpc/jobmanager_2#-613206152]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.

...

 2022-05-18 14:45:06,090 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Fatal error occurred in TaskExecutor akka.tcp://flink@hdfs04-dev.yingzi.com:32782/user/rpc/taskmanager_0.
1048 org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
1049     at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1440) ~[flink-dist_2.12-1.13.2.jar:1.13.2]
1050     at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$17(TaskExecutor.java:1425) ~[flink-dist_2.12-1.13.2.jar:1.13.2

...

2022-05-18 14:45:06,106 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Stopping TaskExecutor akka.tcp://flink@hdfs04-dev.yingzi.com:32782/user/rpc/taskmanager_0.
1102 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
1103 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Stopping DefaultLeaderRetrievalService.
1104 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetri     evalDriver{retrievalPath='/leader/resource_manager_lock'}.
1105 2022-05-18 14:45:06,110 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
1106 2022-05-18 14:45:06,114 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /hadoop/yarn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-io-2f303ad9-eef4-4048-846e-85dc29da5038
1107 2022-05-18 14:45:06,114 INFO  org.apache.flink.runtime.io.network.NettyShuffleEnvironment  [] - Shutting down the network environment and its components.
1108 2022-05-18 14:45:06,114 INFO  org.apache.flink.runtime.io.network.netty.NettyClient        [] - Successful shutdown (took 0 ms).
1109 2022-05-18 14:45:06,117 INFO  org.apache.flink.runtime.io.network.netty.NettyServer        [] - Successful shutdown (took 2 ms).
1110 2022-05-18 14:45:06,125 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /hadoop/yarn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-netty-shuffle-5b4f703f-8827-4139-8237-6b9bfdfe     fe6e
1111 2022-05-18 14:45:06,125 INFO  org.apache.flink.runtime.taskexecutor.KvStateService         [] - Shutting down the kvState service and its components.
1112 2022-05-18 14:45:06,125 INFO  org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService [] - Stop job leader service.
1113 2022-05-18 14:45:06,129 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /hadoop/ya     rn/local/usercache/hdfs/appcache/application_1652757164854_0006/flink-dist-cache-f2a2be98-66ed-45c2-815b-a981e7a90d05

然后才恢复正常
可以看出是触发了心跳超时,进行触发故障转移,从checkpoint恢复job

(5)杀TaskManager所在机器的NodeManager进程

刚杀完TaskManager进程正常,任务也可以正常工作
十几分钟之后,TaskManager自动转移到新的节点重启
现象基本和上面(2)类似

(6)先杀TaskManager所在机器的NodeManager进程,再杀TaskManager进程

flink ui上显示TaskManager状态正常且还在原来节点,但是其实任务已经不正常,并且checkpoint操作失败
除了tm不会立刻转移,现象基本和上面(4)一致

结论&问题

开启HA的情况下,即使出现某个服务进程,甚至某台机器直接挂掉的情况,任务也能自动恢复。具体恢复的时间取决于挂掉的服务情况和几个超时时间参数的配置;

如果某台机器直接宕机,任务恢复可能要十几分钟,可以通过调整参数来缩短故障恢复时间,但是需要进一步研究超时时间缩短可能带来的副作用