一、使用

通过watchdog的启动以及系统服务注册watchdog等入手来看一下它是如何运作的。

启动watchdog

private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
    t.traceBegin("startBootstrapServices");

    // Start the watchdog as early as possible so we can crash the system server
    // if we deadlock during early boot
    t.traceBegin("StartWatchdog");
    final Watchdog watchdog = Watchdog.getInstance();
    watchdog.start();
    t.traceEnd();

注册watchdog监测

以AMS的注册为例:

Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);

二、机制分析

注册

注册watchdog监听有两种监听:addMonitor和addThread

addMonitor:

private final HandlerChecker mMonitorChecker;

public interface Monitor {
    void monitor();
}

public void addMonitor(Monitor monitor) {
    synchronized (this) {
        mMonitorChecker.addMonitorLocked(monitor);
    }
}
public final class HandlerChecker implements Runnable {
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
    private boolean mCompleted;
    private Monitor mCurrentMonitor;
    private long mStartTime;
    private int mPauseCount;

    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler;
        mName = name;
        mWaitMax = waitMaxMillis;
        mCompleted = true;
    }

    void addMonitorLocked(Monitor monitor) {
        // We don't want to update mMonitors when the Handler is in the middle of checking
        // all monitors. We will update mMonitors on the next schedule if it is safe
        mMonitorQueue.add(monitor);
    }

把monitor注册到一个mMonitorChecker的用来保存monitor的叫做mMonitorQueue的数组中。

addThread:

public void addThread(Handler thread) {
    addThread(thread, DEFAULT_TIMEOUT);
}

public void addThread(Handler thread, long timeoutMillis) {
    synchronized (this) {
        final String name = thread.getLooper().getThread().getName();
        mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
    }
}

创建一个HandlerChecker持有注册的Handler,然后把这个HandlerChecker放入watchdog的用来保存HandlerChecker的全局容器mHandlerCheckers中。是用来监听Handler的消息队列的。

再来看一下watchdog的构造:

private Watchdog() {
    super("watchdog");
    // Initialize handler checkers for each common thread we want to check.  Note
    // that we are not currently checking the background thread, since it can
    // potentially hold longer running operations with no guarantees about the timeliness
    // of operations there.

    // The shared foreground thread is the main checker.  It is where we
    // will also dispatch monitor checks and do other work.
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
            "foreground thread", DEFAULT_TIMEOUT);
    mHandlerCheckers.add(mMonitorChecker);
    // Add checker for main thread.  We only do a quick check since there
    // can be UI running on the thread.
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
            "main thread", DEFAULT_TIMEOUT));
    // Add checker for shared UI thread.
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
            "ui thread", DEFAULT_TIMEOUT));
    // And also check IO thread.
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
            "i/o thread", DEFAULT_TIMEOUT));
    // And the display thread.
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
            "display thread", DEFAULT_TIMEOUT));
    // And the animation thread.
    mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
            "animation thread", DEFAULT_TIMEOUT));
    // And the surface animation thread.
    mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
            "surface animation thread", DEFAULT_TIMEOUT));

    // Initialize monitor for Binder threads.
    addMonitor(new BinderThreadMonitor());

    mInterestingJavaPids.add(Process.myPid());

    // See the notes on DEFAULT_TIMEOUT.
    assert DB ||
            DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}

可以看出,用来承载monitor的mMonitorChecker也是一个HandlerChecker,并且也存在mHandlerCheckers中,和使用addThread注册方式生成的HandlerChecker一样,都保存在mHandlerCheckers中,除此之外,还有许多HandlerChecker,分别监听着UiThread、IoThread、DisplayThread等。

总结一下:

  1. Watchdog提供了HandlerChecker,是一个检查者,它可以跟踪一个Handler的消息队列或多个自定义Monitor(注意:跟踪Monitor的Checker是一个专门的Checker:mMonitorChecker)。
  2. Watchdog提供了Monitor接口,实现接口的monitor就可以注册进watchdog被监听了。
  3. Watchdog的构造中已经注册了多个HandlerChecker分别对UiThread、IoThread、DisplayThread等线程队列监听,其他需要注册监听的通过调用提供的addThread方法注册,都会单独创建一个HandlerChecker对其监听。
  4. 以上的所有HandlerChecker统一由全局容器mHandlerCheckers保存。

运行

public class Watchdog extends Thread {

可以看出watchdog实际就是一个Thread线程,启动方式watchdog.start()实际就是启动它这个线程。

private static final long DEFAULT_TIMEOUT = DB ? 10 * 1000 : 60 * 1000;
private static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;

@Override
public void run() {
    boolean waitedHalf = false;
    while (true) {
        final List<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        synchronized (this) {
            long timeout = CHECK_INTERVAL; // 30s
            // 1 遍历 mHandlerCheckers 里所有的 Checker,调用其 scheduleCheckLocked
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked();
            }

            long start = SystemClock.uptimeMillis();
            // 2 wait够CHECK_INTERVAL(30s)的时长
            while (timeout > 0) {
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }
            // 3
            // 遍历所有的 Checker,取其中最“糟糕”的状态
            final int waitState = evaluateCheckerCompletionLocked();
            if (waitState == COMPLETED) {
                // COMPLETED表示所有的Checker监听的对象都没有block,
                // 还原所有状态(waitedHalf 用来标记是否已经堵过半分钟了)
                // 并continue本次循环
                waitedHalf = false;
                continue;
            } else if (waitState == WAITING) {
                // 发现有正在block的对象
                continue;
            } else if (waitState == WAITED_HALF) {
                // 发现有已经block过半分钟的对象
                if (!waitedHalf) {
                    Slog.i(TAG, "WAITED_HALF");
                    ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
                    ActivityManagerService.dumpStackTraces(pids, null, null,
                            getInterestingNativePids(), null);
                    waitedHalf = true;
                }
                continue;
            }
            // 上面三个if分支都没走进去,说明已经有高于阻塞半分钟(即有对象连续两个半分钟都在block)的对象了
            // 说明已经发生异常,需要处理了。

            // 取出罪魁祸首,发生严重block的对象
            blockedCheckers = getBlockedCheckersLocked();
            // 取出问题对象们的名称、线程名等信息
            subject = describeCheckersLocked(blockedCheckers);
            // 标记需要重启
            allowRestart = mAllowRestart;
        }

        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

        ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
        // ************* 开始收集信息,写日志 *************
        long anrTime = SystemClock.uptimeMillis();
        StringBuilder report = new StringBuilder();
        report.append(MemoryPressureUtil.currentPsiState());
        ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
        StringWriter tracesFileException = new StringWriter();
        final File stack = ActivityManagerService.dumpStackTraces(
                pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
                tracesFileException);

        SystemClock.sleep(5000);

        processCpuTracker.update();
        report.append(processCpuTracker.printCurrentState(anrTime));
        report.append(tracesFileException.getBuffer());

        doSysRq('w');
        doSysRq('l');

        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                public void run() {
                    if (mActivity != null) {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null, null,
                                subject, report.toString(), stack, null);
                    }
                    FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
                            subject);
                }
            };
        dropboxThread.start();
        try {
            dropboxThread.join(2000);
        } catch (InterruptedException ignored) {}
        // ****************************************************

        IActivityController controller;
        synchronized (this) {
            controller = mController;
        }
        if (controller != null) {
            Slog.i(TAG, "Reporting stuck state to activity controller");
            try {
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    Slog.i(TAG, "Activity controller requested to coninue to wait");
                    waitedHalf = false;
                    continue;
                }
            } catch (RemoteException e) {
            }
        }
        // 退出系统
        if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
            Slog.w(TAG, "*** GOODBYE!");
            Process.killProcess(Process.myPid());
            System.exit(10);
        }

        waitedHalf = false;
    }
}

(为了方便理解,删除了debug场景的一些代码,后面再单独讲)

首先提一下watchdog的策略block上限是60s。

从代码可以看出,run方法里就是一个无限死循环,在循环体内主要分三步走:

  1. 发起一次所有Checker的检查,
  2. 等待30s,
  3. 检查所有Checker监听的对象的状态,根据状态判断是否需要dump以及重启,如果不需要就跳入下一次循环。

下面来看具体如何实现上面的1 2 3的

1 发起Checker的检查

public void scheduleCheckLocked() {
    if (mCompleted) {
        mMonitors.addAll(mMonitorQueue);
        mMonitorQueue.clear();
    }
    if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
            || (mPauseCount > 0)) {
        mCompleted = true;
        return;
    }
    if (!mCompleted) {
        return;
    }
    // 设置状态未完成
    mCompleted = false;
    mCurrentMonitor = null;
    // 记录发起时间
    mStartTime = SystemClock.uptimeMillis();
    // 向被监听的对象的消息队列发送一个检查消息this,消息将会执行run方法
    mHandler.postAtFrontOfQueue(this);
}

@Override
public void run() {
    final int size = mMonitors.size();
    // monitor的检测,遍历所有注册进来的monitor,检测方式就是调用它的monitor方法
    for (int i = 0 ; i < size ; i++) {
        synchronized (Watchdog.this) {
            mCurrentMonitor = mMonitors.get(i);
        }
        mCurrentMonitor.monitor();
    }

    synchronized (Watchdog.this) {
        // 执行到这里,说明消息队列没堵塞,并且monitor也都没堵塞,把状态改为true
        mCompleted = true;
        mCurrentMonitor = null;
    }
}

发起检查实际就是用注册进来的Handler发送一个监测消息,如果消息队列没阻塞,消息就能正常执行(就是自己的run方法),在run方法里还有monitor的检测,如果检测都通过了,就把状态mCompleted改回true。

2 等待30s

3 收集1发起的检查结果

evaluateCheckerCompletionLocked

private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}
// HandlerChecker
public int getCompletionStateLocked() {
    if (mCompleted) {
        // mCompleted = true,看步骤1讲解,1发起的消息run已经执行完了
        // 返回 COMPLETED
        return COMPLETED;
    } else {
        // latency:从1发起消息到现在的耗时时长
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) {
            // 小于半分钟,给 WAITING
            return WAITING;
        } else if (latency < mWaitMax) {
            // 半分钟 < 1分钟,给WAITED_HALF
            return WAITED_HALF;
        }
        // 这两个都不符合,说明已经超过1分钟,给 OVERDUE
    }
    return OVERDUE;
}

这个mWaitMax时长是创建Checker时构造传入的,当前代码中看给的都是DEFAULT_TIMEOUT 1分钟,为了方便讲解本文都以1分钟计。

Checker一共提供了4种状态:

COMPLETED:如果步骤1发起的检查消息执行完成会把mCompleted设置为true,表示Checker检查通过,当前没有谁block

WAITING:步骤1发起的检查消息未完成,当前收集状态时间 - 检查消息的发起时间 < 30s

WAITED_HALF:步骤1发起的检查消息未完成,30 < 当前收集状态时间 - 检查消息的发起时间 < 60s

OVERDUE:步骤1发起的检查消息未完成,当前收集状态时间 - 检查消息的发起时间 > 60s

三、其他设置

暂停监测

public void pauseWatchingCurrentThread(String reason) {
    synchronized (this) {
        for (HandlerChecker hc : mHandlerCheckers) {
            if (Thread.currentThread().equals(hc.getThread())) {
                hc.pauseLocked(reason);
            }
        }
    }
}

public void resumeWatchingCurrentThread(String reason) {
    synchronized (this) {
        for (HandlerChecker hc : mHandlerCheckers) {
            if (Thread.currentThread().equals(hc.getThread())) {
                hc.resumeLocked(reason);
            }
        }
    }
}
// HandlerChecker
public void pauseLocked(String reason) {
    mPauseCount++;
    mCompleted = true;
    Slog.i(TAG, "Pausing HandlerChecker: " + mName + " for reason: "
            + reason + ". Pause count: " + mPauseCount);
}

public void resumeLocked(String reason) {
    if (mPauseCount > 0) {
        mPauseCount--;
        Slog.i(TAG, "Resuming HandlerChecker: " + mName + " for reason: "
                + reason + ". Pause count: " + mPauseCount);
    } else {
        Slog.wtf(TAG, "Already resumed HandlerChecker: " + mName);
    }
}

可以看出,当调用HandlerChecker的pauseLocked时:

  1. mCompleted直接设置为true,我们知道这个是检查消息是否执行通过的标志。
  2. mPauseCount++ 我们来看一段代码
public void scheduleCheckLocked() {
    if (mCompleted) {
        mMonitors.addAll(mMonitorQueue);
        mMonitorQueue.clear();
    }
    if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
            || (mPauseCount > 0)) { // 判断 mPauseCount > 0
        mCompleted = true;
        return;
    }
    if (!mCompleted) {
        return;
    }
    // 设置状态未完成
    mCompleted = false;
    mCurrentMonitor = null;
    // 记录发起时间
    mStartTime = SystemClock.uptimeMillis();
    // 向被监听的对象的消息队列发送一个检查消息this,消息将会执行run方法
    mHandler.postAtFrontOfQueue(this);
}

这是发送监测消息的代码,可以看到,如果判断mPauseCount > 0成立,直接mCompleted = true并返回,不再发送检查消息。也就是说 mPauseCount是一个通行证,有了它不再做检查,直接通过。

那么为什么有这个东西呢?我找了一些使用这个功能的代码

t.traceBegin("StartPackageManagerService");
try {
    Watchdog.getInstance().pauseWatchingCurrentThread("packagemanagermain");
    mPackageManagerService = PackageManagerService.main(mSystemContext, installer,
            mFactoryTestMode != FactoryTest.FACTORY_TEST_OFF, mOnlyCore);
} finally {
    Watchdog.getInstance().resumeWatchingCurrentThread("packagemanagermain");
}

t.traceBegin("StartOtaDexOptService");
try {
    Watchdog.getInstance().pauseWatchingCurrentThread("moveab");
    OtaDexoptService.main(mSystemContext, mPackageManagerService);
} catch (Throwable e) {
    reportWtf("starting OtaDexOptService", e);
} finally {
    Watchdog.getInstance().resumeWatchingCurrentThread("moveab");
    t.traceEnd();
}

出现的场景有比如开机PackageManagerService启动做包扫描、dex优化等场景会调用暂停watchdog检查。推测是这些场景本身就是十分耗时的,并且是在开机的必须过程,从设计上看需要这样做,就暂时把watchdog检查关闭了。

Debug调试场景

@Override
public void run() {
    boolean waitedHalf = false;
    while (true) {
        final List<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        int debuggerWasConnected = 0;
        synchronized (this) {
            long timeout = CHECK_INTERVAL;
			// 1 发起检查
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked();
            }

            if (debuggerWasConnected > 0) {
                debuggerWasConnected--;
            }

            long start = SystemClock.uptimeMillis();
			// 2 等待30s
            while (timeout > 0) {
                // 如果当前是debug调试
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }
            // 3 获取状态
            final int waitState = evaluateCheckerCompletionLocked();
            if (waitState == COMPLETED) {
                waitedHalf = false;
                continue;
            } else if (waitState == WAITING) {
                continue;
            } else if (waitState == WAITED_HALF) {
                if (!waitedHalf) {
                    Slog.i(TAG, "WAITED_HALF");
                    ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
                    ActivityManagerService.dumpStackTraces(pids, null, null,
                            getInterestingNativePids(), null);
                    waitedHalf = true;
                }
                continue;
            }

            blockedCheckers = getBlockedCheckersLocked();
            subject = describeCheckersLocked(blockedCheckers);
            allowRestart = mAllowRestart;
        }

        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

        // 日志...
        ...

        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        // 不重启
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
            Slog.w(TAG, "*** GOODBYE!");
            Process.killProcess(Process.myPid());
            System.exit(10);
        }

        waitedHalf = false;
    }
}

从代码可以看出,当debug调试时,会打上debuggerWasConnected标识,主要是为了最后重启判断时不进入重启分支。并且在退出debug调试后,也要循环 debuggerWasConnected -- 两次才恢复允许重启。

Binder监测

private Watchdog() {
    ...
    // Initialize monitor for Binder threads.
    addMonitor(new BinderThreadMonitor());

Watchdog构造的时候注册了一个BinderThreadMonitor

private static final class BinderThreadMonitor implements Watchdog.Monitor {
    @Override
    public void monitor() {
        Binder.blockUntilThreadAvailable();
    }
}

可以看到monitor监测调用的Binder.blockUntilThreadAvailable,最终调用到 IPCThreadState::blockUntilThreadAvailable

//IPCThreadState
void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

blockUntilThreadAvailable是判断进程当前正在运行的binder线程是否达到最大值,如果超出mMaxThreads就阻塞。可以看出这个monitor的意图就是检查进程的binder线程是否满了。

前面有介绍过,watchdog是在systemserver进程启动的,这里监测的进程的binder线程是否满了,实际就是监测的systemserver进程。

private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        t.traceBegin("startBootstrapServices");

        // Start the watchdog as early as possible so we can crash the system server
        // if we deadlock during early boot
        t.traceBegin("StartWatchdog");
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        t.traceEnd();

四、总结

一、注册:

Watchdog 提供了一个内部类用来执行发起监测、收集目标状态,这里叫它Checker。它提供了两种监测模式:

  1. 监测目标的线程循环消息队列阻塞
    需要在创建Checker时传入要监测的目标的Handler。
  2. 自定义监测 —— Monitor
    Watchdog创建了一个专门的Checker用来管理所有的Monitor,这个Checker的Handler是FgThread.getHandler()。使用者通过调用watchdog.addMonitor(Monitor),Watchdog就会自动把Monitor放入这个专门的Checker去管理。

Checker介绍

发起监测:

  1. 对于消息队列阻塞的检查:是通过注册进来的Handler发起一个消息(消息是Checker本身,它是一个Runnable)
  2. 对于Monitor自定义检测:使用FgThread的Handler发送一个消息,消息的run方法中会逐个执行所有的Monitor的检测方法。

检查状态:

假如没有阻塞,发起的检测消息就已经执行完毕并重载了完成状态;假如发生阻塞,完成状态为false,就要比较当前时间和发起监测消息的时间得到超时时间,再根据情况返回对应的状态。如果有目标阻塞超过60s,说明已经发生OVERDUE。

二、启动Watchdog:

Watchdog 是一个Thread线程,由systemserver开始启动系统服务前start启动。

run方法是一个无限循环,循环执行三步:

  1. 遍历所有Checker触发他们发起检测
  2. 等待30s
  3. 遍历所有Checker检查状态,通过反馈的监控目标的状态判断是否异常

三、发生异常的处理

1 收集日志:包括收集阻塞的目标、阻塞原因、以及其他信息等写日志

2 根据设置判断是否退出系统