flink on yarn checkpoint 重启 flink的checkpoint机制与恢复

转载

flybirdfly 2023-12-21 12:41:28

文章标签 flink checkpoint 恢复数据重启进程状态 文章分类 Yarn 大数据

flink on yarn checkpoint 重启 flink的checkpoint机制与恢复_重启

阐述 Flink 提供的容错机制，解释分布式快照 Chandy Lamport 算法逻辑，剖析 Flink Checkpoint 具体实现流程？

1 容错机制

Flink 容错机制主要是状态的保存和恢复，涉及 state backends 状态后端、checkpoint 和 savepoint，还有 Job 和 Task 的错误恢复。

1.1 State Backends 状态后端

Flink 状态后端是指保存 Checkpoint 数据的容器，其分类有MemoryStateBackend、FsStateBackend、RocksDBStateBackend，状态的分类有 operator state 和 keyed state。

flink on yarn checkpoint 重启 flink的checkpoint机制与恢复_数据_02

① MemoryStateBackend：默认，本地调试使用，小状态。 ② FsStateBackend：高可用场景使用，大状态、长窗口。 ③ RocksDBStateBackend：高可用场景使用，可增量 checkpoint，超大状态、长窗口。

1.2 State 状态的保存和恢复

Flink 状态保存和恢复主要依靠 Checkpoint 机制和 Savepoint 机制。 || Checkpoint 机制 | Savepoint机制 | | ------ | ------ | ------ | | 保存 | 定时制作分布式快照 | 用户手动触发备份和停止作业 | | 恢复

1.2.1 相关概念

（1）Snapshot

快照的概念来源于相片，指照相馆的一种冲洗过程短的照片。在计算机领域，快照是数据存储的某一时刻的状态记录。Flink Snapshot 快照是指作业状态的全局一致记录。一个完整的快照是包括 source 算子的状态（例如，消费 kafka partition 的 offset）、状态算子的缓存数据和 sink 算子的状态（批量缓存数据、事务数据等）。

（2）Checkpoint

Checkpoint 检查点可以自动产生快照，用于Flink 故障恢复。Checkpoint 具有分布式、异步、增量的特点。

（3）Savepoint

Savepoint 保存点是用户手动触发的，保存全量的作业状态数据。一般使用场景是作业的升级、作业的并发度缩放、迁移集群等。

1.2.2 Snapshot 快照机制

Flink 是采用轻量级的分布式异步快照，其实现是采用栅栏 barrier 作为 checkpoint 的传递信号，与业务数据一样无差别地传递下去，目的是使得数据流被切分成微批，进行 checkpoint 保存为 snapshot。当 barrier 经过流图节点的时候，Flink 进行 checkpoint 保存状态数据。如下图所示，checkpoint n 包含每个算子的状态，该状态是指checkpoint n 之前的全部事件，而不包含它之后的所有事件。

flink on yarn checkpoint 重启 flink的checkpoint机制与恢复_进程状态_03

Checkpoint Barrier 对齐机制，如下图所示。当 ExecutionGraph 物理执行图中的 subtask 算子实例接收到 barrier 的时候，subtask 会记录它的状态数据。如果 subtask 有2个上游（例如 KeyedProcessFunction、CoProcessFunction等），subtask 会收到上游的2个 barrier 后再触发 checkpoint（即 barrier 对齐）。

flink on yarn checkpoint 重启 flink的checkpoint机制与恢复_进程状态_04

copy-on-write 写时复制机制，即当旧状态数据在进行异步快照的同时，可以不阻塞业务数据的实时处理。只有快照数据被持久化后，旧状态数据才会被垃圾回收。

1.2.3 保证 Exactly-Once 语义

针对用户作业出现故障而导致结果丢失或者重复的问题，Flink 提供3种语义： ① At-Least-Once 最少一次：不会丢失数据，但可能会有重复结果。 ② Exactly-Once 精确一次：checkpoint barrier 对齐机制可以保障精确一次。

// 最少一次
CheckpointingMode.AT_LEAST_ONCE

// 精确一次
CheckpointingMode.EXACTLY_ONCE

此处 Exactly-Once 语义是指 Flink 内部精确一次，而不是端到端精确一次。如果需要端到端 Exactly-Once，需要外部存储的客户端提供回滚和事务，即对应的 source 有回滚功能和 sink 有事务功能（例如，kafka connector 提供回滚和事务，相关内容后续更新）。

1.2.4 Job 和 Task 的错误恢复策略

（1）Job Restart 策略

① FailureRateRestartStrategy：允许在指定时间间隔内的最大失败次数，同时可以设置重启延时时间。 ② FixedDelayRestartStrategy：允许指定的失败次数，同时可以设置重启延时时间。 ③ NoRestartStrategy：不需要重启，即 Job 直接失败。 ④ ThrowingRestartStrategy：不需要重启，直接抛异常。 Job Restart 策略可以通过 env 设置。

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
  3, // number of restart attempts
  Time.of(10, TimeUnit.SECONDS) // delay
));

上述策略的父类接口是RestartStrategy，其关键是restart（重启操作）。

/**
 * Strategy for {@link ExecutionGraph} restarts.
 */
public interface RestartStrategy {

    /**
     * True if the restart strategy can be applied to restart the {@link ExecutionGraph}.
     *
     * @return true if restart is possible, otherwise false
     */
    boolean canRestart();

    /**
     * Called by the ExecutionGraph to eventually trigger a full recovery.
     * The recovery must be triggered on the given callback object, and may be delayed
     * with the help of the given scheduled executor.
     *
     * <p>The thread that calls this method is not supposed to block/sleep.
     *
     * @param restarter The hook to restart the ExecutionGraph
     * @param executor An scheduled executor to delay the restart
     * @return A {@link CompletableFuture} that will be completed when the restarting process is done.
     */
    CompletableFuture<Void> restart(RestartCallback restarter, ScheduledExecutor executor);
}

（2）Task Failover 策略

① RestartAllStrategy：重启全部 task，默认策略。 ② RestartIndividualStrategy：恢复单个 task。如果该 task 没有source，可能导致数据丢失。 ③ NoOpFailoverStrategy：不恢复 task。上述策略的父类接口是FailoverStrategy，其关键是Factory的create（创建 strategy）、onTaskFailure（处理错误）。

jobmanager.execution.failover-strategy。 ② ExecutionGraph 是 Job 重启的对象即作业的物理执行图，Execution 是 Task 重启的对象即 subtask。

/**
 * A {@code FailoverStrategy} describes how the job computation recovers from task
 * failures.
 * 
 * <p>Failover strategies implement recovery logic for failures of tasks. The execution
 * graph still implements "global failure / recovery" (which restarts all tasks) as
 * a fallback plan or safety net in cases where it deems that the state of the graph
 * may have become inconsistent.
 */
public abstract class FailoverStrategy {


    // ------------------------------------------------------------------------
    //  failover implementation
    // ------------------------------------------------------------------------ 

    /**
     * Called by the execution graph when a task failure occurs.
     * 
     * @param taskExecution The execution attempt of the failed task. 
     * @param cause The exception that caused the task failure.
     */
    public abstract void onTaskFailure(Execution taskExecution, Throwable cause);

    /**
     * Called whenever new vertices are added to the ExecutionGraph.
     * 
     * @param newJobVerticesTopological The newly added vertices, in topological order.
     */
    public abstract void notifyNewVertices(List<ExecutionJobVertex> newJobVerticesTopological);

    /**
     * Gets the name of the failover strategy, for logging purposes.
     */
    public abstract String getStrategyName();

    /**
     * Tells the FailoverStrategy to register its metrics.
     * 
     * <p>The default implementation does nothing
     * 
     * @param metricGroup The metric group to register the metrics at
     */
    public void registerMetrics(MetricGroup metricGroup) {}

    // ------------------------------------------------------------------------
    //  factory
    // ------------------------------------------------------------------------

    /**
     * This factory is a necessary indirection when creating the FailoverStrategy to that
     * we can have both the FailoverStrategy final in the ExecutionGraph, and the
     * ExecutionGraph final in the FailOverStrategy.
     */
    public interface Factory {

        /**
         * Instantiates the {@code FailoverStrategy}.
         * 
         * @param executionGraph The execution graph for which the strategy implements failover.
         * @return The instantiated failover strategy.
         */
        FailoverStrategy create(ExecutionGraph executionGraph);
    }
}

2 Chandy Lamport 算法详解

2.1 背景

如何产生可靠的全局一致性快照是分布式系统的难点，其传统方案是使用的全局时钟，但存在单点故障、数据不一致等可靠性问题。为了解决该问题，Chandy-Lamport 算法采用 marker 的传播来代替全局时钟。

全局快照的概念：Global Snapshot 即全局状态 Global State，应用于系统 Failure Recovery。

2.2 Chandy Lamport 算法

分布式系统的简化：一个有向图，其中节点是进程，边是channel。

（1）快照初始化

① 进程 Pi 记录自己的进程状态，同时生产一个标识信息 marker（与正常 message 不同），通过 ouput channel 发送给系统里面的其他进程。 ② 进程 Pi 开始记录所有 input channel 接收到的 message

（2）快照进行

进程 Pj 从 input channel Ckj 接收到 marker。如果 Pj 还没有记录自己的进程状态，则 Pj 记录自己的进程状态，向 output channel 发送 marker；否则 Pj 正在记录自己的进程状态（该 marker 之前的 message）。

marker相当于一个分隔符，把无限的数据流分隔为一批一批数据。每一批数据进都行快照，例如进程Pj，处理的 message 为[n6,n5,marker2,n4,marker1,n3,n2,n1]，Pj 接收到 marker1 后，快照记录n3,n2,n1，接受到 marker2后，快照记录n4。

（3）快照完成

所有的进程都收到 marker 信息并且记录下自己的状态和 channel 的状态（包含的 message）。

2.3 总结

Flink 的分布式异步快照实现了Chandy Lamport 算法，其核心思想是在 source 插入 barrier 代替 Chandy-Lamport 算法中的 marker，通过控制 barrier 的同步来实现 snapshot 的备份和 Exactly-Once 语义。

3 Checkpoint 实现流程

第一步：Checkpoint Coordinator触发Checkpoint

Checkpoint Coordinator 向所有 source 节点 trigger Checkpoint。

flink on yarn checkpoint 重启 flink的checkpoint机制与恢复_数据_05

第二步：source向下游广播barrier

source task向下游广播barrier。

flink on yarn checkpoint 重启 flink的checkpoint机制与恢复_重启_06

每个source task都会产生同批次的barrier，向下游广播。例如上图，source task1 和 source task2 产生barrier n，向下游广播。

第三步：source通知coordinator完成备份

当source task备份完自己的状态后，会将备份数据的地址（state handle）通知 Checkpoint Coordinator。

flink on yarn checkpoint 重启 flink的checkpoint机制与恢复_进程状态_07

同步阶段：task执行状态快照，并写入外部存储系统，其执行快照的过程 a.深拷贝state。 b.将写操作封装在异步的FutureTask中，其FutureTask的作用包括：=》打开输入流 =》写入状态的元数据信息 =》写入状态 =》关闭输入流。 2.异步阶段：执行同步阶段创建的FutureTask，向Checkpoint Coordinator发送ACK响应。