一、使用
通过watchdog的启动以及系统服务注册watchdog等入手来看一下它是如何运作的。
启动watchdog
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
t.traceBegin("startBootstrapServices");
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
t.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
t.traceEnd();
注册watchdog监测
以AMS的注册为例:
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
二、机制分析
注册
注册watchdog监听有两种监听:addMonitor和addThread
addMonitor:
private final HandlerChecker mMonitorChecker;
public interface Monitor {
void monitor();
}
public void addMonitor(Monitor monitor) {
synchronized (this) {
mMonitorChecker.addMonitorLocked(monitor);
}
}
public final class HandlerChecker implements Runnable {
private final Handler mHandler;
private final String mName;
private final long mWaitMax;
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
private boolean mCompleted;
private Monitor mCurrentMonitor;
private long mStartTime;
private int mPauseCount;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
void addMonitorLocked(Monitor monitor) {
// We don't want to update mMonitors when the Handler is in the middle of checking
// all monitors. We will update mMonitors on the next schedule if it is safe
mMonitorQueue.add(monitor);
}
把monitor注册到一个mMonitorChecker的用来保存monitor的叫做mMonitorQueue的数组中。
addThread:
public void addThread(Handler thread) {
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
创建一个HandlerChecker持有注册的Handler,然后把这个HandlerChecker放入watchdog的用来保存HandlerChecker的全局容器mHandlerCheckers中。是用来监听Handler的消息队列的。
再来看一下watchdog的构造:
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// And the animation thread.
mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
"animation thread", DEFAULT_TIMEOUT));
// And the surface animation thread.
mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
可以看出,用来承载monitor的mMonitorChecker也是一个HandlerChecker,并且也存在mHandlerCheckers中,和使用addThread注册方式生成的HandlerChecker一样,都保存在mHandlerCheckers中,除此之外,还有许多HandlerChecker,分别监听着UiThread、IoThread、DisplayThread等。
总结一下:
- Watchdog提供了HandlerChecker,是一个检查者,它可以跟踪一个Handler的消息队列或多个自定义Monitor(注意:跟踪Monitor的Checker是一个专门的Checker:mMonitorChecker)。
- Watchdog提供了Monitor接口,实现接口的monitor就可以注册进watchdog被监听了。
- Watchdog的构造中已经注册了多个HandlerChecker分别对UiThread、IoThread、DisplayThread等线程队列监听,其他需要注册监听的通过调用提供的addThread方法注册,都会单独创建一个HandlerChecker对其监听。
- 以上的所有HandlerChecker统一由全局容器mHandlerCheckers保存。
运行
public class Watchdog extends Thread {
可以看出watchdog实际就是一个Thread线程,启动方式watchdog.start()实际就是启动它这个线程。
private static final long DEFAULT_TIMEOUT = DB ? 10 * 1000 : 60 * 1000;
private static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final List<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
synchronized (this) {
long timeout = CHECK_INTERVAL; // 30s
// 1 遍历 mHandlerCheckers 里所有的 Checker,调用其 scheduleCheckLocked
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
long start = SystemClock.uptimeMillis();
// 2 wait够CHECK_INTERVAL(30s)的时长
while (timeout > 0) {
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
// 3
// 遍历所有的 Checker,取其中最“糟糕”的状态
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// COMPLETED表示所有的Checker监听的对象都没有block,
// 还原所有状态(waitedHalf 用来标记是否已经堵过半分钟了)
// 并continue本次循环
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// 发现有正在block的对象
continue;
} else if (waitState == WAITED_HALF) {
// 发现有已经block过半分钟的对象
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null);
waitedHalf = true;
}
continue;
}
// 上面三个if分支都没走进去,说明已经有高于阻塞半分钟(即有对象连续两个半分钟都在block)的对象了
// 说明已经发生异常,需要处理了。
// 取出罪魁祸首,发生严重block的对象
blockedCheckers = getBlockedCheckersLocked();
// 取出问题对象们的名称、线程名等信息
subject = describeCheckersLocked(blockedCheckers);
// 标记需要重启
allowRestart = mAllowRestart;
}
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
// ************* 开始收集信息,写日志 *************
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
final File stack = ActivityManagerService.dumpStackTraces(
pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
tracesFileException);
SystemClock.sleep(5000);
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());
doSysRq('w');
doSysRq('l');
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
if (mActivity != null) {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null, null,
subject, report.toString(), stack, null);
}
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
subject);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000);
} catch (InterruptedException ignored) {}
// ****************************************************
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// 退出系统
if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
(为了方便理解,删除了debug场景的一些代码,后面再单独讲)
首先提一下watchdog的策略block上限是60s。
从代码可以看出,run方法里就是一个无限死循环,在循环体内主要分三步走:
- 发起一次所有Checker的检查,
- 等待30s,
- 检查所有Checker监听的对象的状态,根据状态判断是否需要dump以及重启,如果不需要就跳入下一次循环。
下面来看具体如何实现上面的1 2 3的
1 发起Checker的检查
public void scheduleCheckLocked() {
if (mCompleted) {
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
mCompleted = true;
return;
}
if (!mCompleted) {
return;
}
// 设置状态未完成
mCompleted = false;
mCurrentMonitor = null;
// 记录发起时间
mStartTime = SystemClock.uptimeMillis();
// 向被监听的对象的消息队列发送一个检查消息this,消息将会执行run方法
mHandler.postAtFrontOfQueue(this);
}
@Override
public void run() {
final int size = mMonitors.size();
// monitor的检测,遍历所有注册进来的monitor,检测方式就是调用它的monitor方法
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
// 执行到这里,说明消息队列没堵塞,并且monitor也都没堵塞,把状态改为true
mCompleted = true;
mCurrentMonitor = null;
}
}
发起检查实际就是用注册进来的Handler发送一个监测消息,如果消息队列没阻塞,消息就能正常执行(就是自己的run方法),在run方法里还有monitor的检测,如果检测都通过了,就把状态mCompleted改回true。
2 等待30s
3 收集1发起的检查结果
evaluateCheckerCompletionLocked
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
// HandlerChecker
public int getCompletionStateLocked() {
if (mCompleted) {
// mCompleted = true,看步骤1讲解,1发起的消息run已经执行完了
// 返回 COMPLETED
return COMPLETED;
} else {
// latency:从1发起消息到现在的耗时时长
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
// 小于半分钟,给 WAITING
return WAITING;
} else if (latency < mWaitMax) {
// 半分钟 < 1分钟,给WAITED_HALF
return WAITED_HALF;
}
// 这两个都不符合,说明已经超过1分钟,给 OVERDUE
}
return OVERDUE;
}
这个mWaitMax时长是创建Checker时构造传入的,当前代码中看给的都是DEFAULT_TIMEOUT 1分钟,为了方便讲解本文都以1分钟计。
Checker一共提供了4种状态:
COMPLETED:如果步骤1发起的检查消息执行完成会把mCompleted设置为true,表示Checker检查通过,当前没有谁block
WAITING:步骤1发起的检查消息未完成,当前收集状态时间 - 检查消息的发起时间 < 30s
WAITED_HALF:步骤1发起的检查消息未完成,30 < 当前收集状态时间 - 检查消息的发起时间 < 60s
OVERDUE:步骤1发起的检查消息未完成,当前收集状态时间 - 检查消息的发起时间 > 60s
三、其他设置
暂停监测
public void pauseWatchingCurrentThread(String reason) {
synchronized (this) {
for (HandlerChecker hc : mHandlerCheckers) {
if (Thread.currentThread().equals(hc.getThread())) {
hc.pauseLocked(reason);
}
}
}
}
public void resumeWatchingCurrentThread(String reason) {
synchronized (this) {
for (HandlerChecker hc : mHandlerCheckers) {
if (Thread.currentThread().equals(hc.getThread())) {
hc.resumeLocked(reason);
}
}
}
}
// HandlerChecker
public void pauseLocked(String reason) {
mPauseCount++;
mCompleted = true;
Slog.i(TAG, "Pausing HandlerChecker: " + mName + " for reason: "
+ reason + ". Pause count: " + mPauseCount);
}
public void resumeLocked(String reason) {
if (mPauseCount > 0) {
mPauseCount--;
Slog.i(TAG, "Resuming HandlerChecker: " + mName + " for reason: "
+ reason + ". Pause count: " + mPauseCount);
} else {
Slog.wtf(TAG, "Already resumed HandlerChecker: " + mName);
}
}
可以看出,当调用HandlerChecker的pauseLocked时:
- mCompleted直接设置为true,我们知道这个是检查消息是否执行通过的标志。
- mPauseCount++ 我们来看一段代码
public void scheduleCheckLocked() {
if (mCompleted) {
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) { // 判断 mPauseCount > 0
mCompleted = true;
return;
}
if (!mCompleted) {
return;
}
// 设置状态未完成
mCompleted = false;
mCurrentMonitor = null;
// 记录发起时间
mStartTime = SystemClock.uptimeMillis();
// 向被监听的对象的消息队列发送一个检查消息this,消息将会执行run方法
mHandler.postAtFrontOfQueue(this);
}
这是发送监测消息的代码,可以看到,如果判断mPauseCount > 0成立,直接mCompleted = true并返回,不再发送检查消息。也就是说 mPauseCount是一个通行证,有了它不再做检查,直接通过。
那么为什么有这个东西呢?我找了一些使用这个功能的代码
t.traceBegin("StartPackageManagerService");
try {
Watchdog.getInstance().pauseWatchingCurrentThread("packagemanagermain");
mPackageManagerService = PackageManagerService.main(mSystemContext, installer,
mFactoryTestMode != FactoryTest.FACTORY_TEST_OFF, mOnlyCore);
} finally {
Watchdog.getInstance().resumeWatchingCurrentThread("packagemanagermain");
}
t.traceBegin("StartOtaDexOptService");
try {
Watchdog.getInstance().pauseWatchingCurrentThread("moveab");
OtaDexoptService.main(mSystemContext, mPackageManagerService);
} catch (Throwable e) {
reportWtf("starting OtaDexOptService", e);
} finally {
Watchdog.getInstance().resumeWatchingCurrentThread("moveab");
t.traceEnd();
}
出现的场景有比如开机PackageManagerService启动做包扫描、dex优化等场景会调用暂停watchdog检查。推测是这些场景本身就是十分耗时的,并且是在开机的必须过程,从设计上看需要这样做,就暂时把watchdog检查关闭了。
Debug调试场景
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final List<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// 1 发起检查
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
long start = SystemClock.uptimeMillis();
// 2 等待30s
while (timeout > 0) {
// 如果当前是debug调试
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
// 3 获取状态
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null);
waitedHalf = true;
}
continue;
}
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
// 日志...
...
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
// 不重启
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
从代码可以看出,当debug调试时,会打上debuggerWasConnected标识,主要是为了最后重启判断时不进入重启分支。并且在退出debug调试后,也要循环 debuggerWasConnected -- 两次才恢复允许重启。
Binder监测
private Watchdog() {
...
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
Watchdog构造的时候注册了一个BinderThreadMonitor
private static final class BinderThreadMonitor implements Watchdog.Monitor {
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
可以看到monitor监测调用的Binder.blockUntilThreadAvailable,最终调用到 IPCThreadState::blockUntilThreadAvailable
//IPCThreadState
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
static_cast<unsigned long>(mProcess->mMaxThreads));
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
blockUntilThreadAvailable是判断进程当前正在运行的binder线程是否达到最大值,如果超出mMaxThreads就阻塞。可以看出这个monitor的意图就是检查进程的binder线程是否满了。
前面有介绍过,watchdog是在systemserver进程启动的,这里监测的进程的binder线程是否满了,实际就是监测的systemserver进程。
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
t.traceBegin("startBootstrapServices");
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
t.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
t.traceEnd();
四、总结
一、注册:
Watchdog 提供了一个内部类用来执行发起监测、收集目标状态,这里叫它Checker。它提供了两种监测模式:
- 监测目标的线程循环消息队列阻塞
需要在创建Checker时传入要监测的目标的Handler。 - 自定义监测 —— Monitor
Watchdog创建了一个专门的Checker用来管理所有的Monitor,这个Checker的Handler是FgThread.getHandler()。使用者通过调用watchdog.addMonitor(Monitor),Watchdog就会自动把Monitor放入这个专门的Checker去管理。
Checker介绍
发起监测:
- 对于消息队列阻塞的检查:是通过注册进来的Handler发起一个消息(消息是Checker本身,它是一个Runnable)
- 对于Monitor自定义检测:使用FgThread的Handler发送一个消息,消息的run方法中会逐个执行所有的Monitor的检测方法。
检查状态:
假如没有阻塞,发起的检测消息就已经执行完毕并重载了完成状态;假如发生阻塞,完成状态为false,就要比较当前时间和发起监测消息的时间得到超时时间,再根据情况返回对应的状态。如果有目标阻塞超过60s,说明已经发生OVERDUE。
二、启动Watchdog:
Watchdog 是一个Thread线程,由systemserver开始启动系统服务前start启动。
run方法是一个无限循环,循环执行三步:
- 遍历所有Checker触发他们发起检测
- 等待30s
- 遍历所有Checker检查状态,通过反馈的监控目标的状态判断是否异常
三、发生异常的处理
1 收集日志:包括收集阻塞的目标、阻塞原因、以及其他信息等写日志
2 根据设置判断是否退出系统