功能
Watchdog用于检查系统重要服务或线程是否堵塞,防止系统卡死(发现系统卡死就干掉自己重启系统进程),是一个针对系统的”ANR“检测工具,同时有接受来自系统服务重启广播进行系统重启的作用。
原理
大体上可以理解成Watchdog跑在一个无限循环的线程上,然后在循环体内安排检测任务。系统服务的检测由一个特定线程(FgThread)负责,其他线程的检测由其自身负责。Watchdog每一轮安排完检测任务后就会阻塞特定时间,阻塞结束后检查所有被检测对象(服务或线程)的检测结果,如果有其中一个服务或线程阻塞就会重启系统进程。
有三个重要类和接口。
一个是Monitor接口,只有一个monitor方法,由被监视的对象(各个系统服务)实现,以InputManagerService的实现为例。
// Called by the heartbeat to ensure locks are not held indefinitely (for deadlock detection).
@Override
public void monitor() {
synchronized (mInputFilterLock) { }
synchronized (mAssociationsLock) { /* Test if blocked by associations lock. */}
synchronized (mLidSwitchLock) { /* Test if blocked by lid switch lock. */ }
synchronized (mInputMonitors) { /* Test if blocked by input monitor lock. */ }
synchronized (mAdditionalDisplayInputPropertiesLock) { /* Test if blocked by props lock */ }
mNative.monitor();
}
就是获取一下同步锁,能获取到锁说明服务运行正常。大部分Monitor都是上面的实现,只有BinderThreadMonitor例外,他是通过是否有可用的BinderThread来检测Bindder线程池是否正常,代码如下
/** Monitor for checking the availability of binder threads. The monitor will block until
* there is a binder thread available to process in coming IPCs to make sure other processes
* can still communicate with the service.
*/
private static final class BinderThreadMonitor implements Watchdog.Monitor {
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
一个是HandlerChecker类。HandlerChecker持有两个Monitor列表,添加新的Monitor时先提交到临时列表mMonitorQueue等到每一轮检测流程开始时才转移到正式列表mMonitors,这样保证Monitor列表不会在遍历检测到一半时发生变化。HandlerChecker实现Runnable接口,在run方法内遍历执行Monitor.monitor(),遍历结束后设置标志位mCompleted表示检测完成。但它不会另起线程,而是post抛到通过构造方法传入的Handler执行。构造方法两个参数,一个是Handler,负责执行检测服务(Monitor)是否阻塞的代码,同时也起到检测自身消息队列是否堵塞的作用,毕竟能够处理HandlerChecker这个消息就代表没有堵塞,对线程的检测就是这个原理;一个是堵塞超时的时间,系统默认会传入60秒。重点只需要看scheduleCheckLocked和run两个方法。
/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
public final class HandlerChecker implements Runnable {
private final Handler mHandler;
private final String mName;
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();//检测Monitor列表
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();//待添加检测Monitor列表
private long mWaitMax;
private boolean mCompleted;
private Monitor mCurrentMonitor;
private long mStartTime;
private int mPauseCount;
HandlerChecker(Handler handler, String name) {
mHandler = handler;
mName = name;
mCompleted = true;
}
void addMonitorLocked(Monitor monitor) {
// We don't want to update mMonitors when the Handler is in the middle of checking
// all monitors. We will update mMonitors on the next schedule if it is safe
mMonitorQueue.add(monitor);
}
public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
mWaitMax = handlerCheckerTimeoutMillis;
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
...
@Override
public void run() {
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (mLock) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (mLock) {
mCompleted = true;
mCurrentMonitor = null;
}
}
/** Pause the HandlerChecker. */
public void pauseLocked(String reason) {
mPauseCount++;
// Mark as completed, because there's a chance we called this after the watchog
// thread loop called Object#wait after 'WAITED_HALF'. In that case we want to ensure
// the next call to #getCompletionStateLocked for this checker returns 'COMPLETED'
mCompleted = true;
Slog.i(TAG, "Pausing HandlerChecker: " + mName + " for reason: "
+ reason + ". Pause count: " + mPauseCount);
}
/** Resume the HandlerChecker from the last {@link #pauseLocked}. */
public void resumeLocked(String reason) {
if (mPauseCount > 0) {
mPauseCount--;
Slog.i(TAG, "Resuming HandlerChecker: " + mName + " for reason: "
+ reason + ". Pause count: " + mPauseCount);
} else {
Slog.wtf(TAG, "Already resumed HandlerChecker: " + mName);
}
}
}
还有一个就是Watchdog本身这个类了。它有两个成员变量
private final ArrayList<HandlerCheckerAndTimeout> mHandlerCheckers = new ArrayList<>();
private final HandlerChecker mMonitorChecker;
mMonitorChecker是系统服务的HandlerChecker,mHandlerCheckers保存包括mMonitorChecker和其他HandlerChecker在内的所有HandlerChecker。他们都会在Watchdog的构造方法内实例化。
重点看下它的run方法,精简代码如下
private void run() {
boolean waitedHalf = false;
while (true) {
...
boolean doWaitedHalfDump = false;
final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;
final long checkIntervalMillis = watchdogTimeoutMillis / 2;
synchronized (mLock) {
long timeout = checkIntervalMillis;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);
hc.checker().scheduleCheckLocked(hc.customTimeoutMillis()
.orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));
}
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
try {
mLock.wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start);
}
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
waitedHalf = true;
subject = describeCheckersLocked(blockedCheckers);
pids = new ArrayList<>(mInterestingJavaPids);
doWaitedHalfDump = true;
} else {
continue;
}
} else {
// something is overdue!
blockedCheckers = getCheckersWithStateLocked(OVERDUE);
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
pids = new ArrayList<>(mInterestingJavaPids);
}
} // END synchronized (mLock)
logWatchog(doWaitedHalfDump, subject, pids);
if (doWaitedHalfDump) {
continue;
}
IActivityController controller;
synchronized (mLock) {
controller = mController;
}
if (controller != null) {
try {
int res = controller.systemNotResponding(subject);
if (res >= 0) {
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
...
Process.killProcess(Process.myPid());
System.exit(10);
...
}
}
每一轮循环都遍历执行HandlerChecker.scheduleCheckLocked(),然后等待超时一半的时间后(比如30秒)执行evaluateCheckerCompletionLocked方法检查所有HandlerChecker的结果,有以下四种情况
// These are temporally ordered: larger values as lateness increases
private static final int COMPLETED = 0;
private static final int WAITING = 1;
private static final int WAITED_HALF = 2;
private static final int OVERDUE = 3;
取所有HandlerChecker结果中最大的一个。如果是COMPLETED或者WAITING代表正常则开始新一轮的检测,如果是WAITED_HALF代表至少有一个HandlerChecker的检测等待超过一半的时间了还没有完成,这时会进行一些日志的打印然后进入下一轮检测。如果是OVERDUE代表至少有一个HandlerChecker超时了即系统卡死,会进入后面的杀死自身(SystemServer)进程,重启系统。有个点需要注意的是每个HandlerChecker所处的时间阶段是独立的,HandlerChecker.scheduleCheckLocked执行时有些HandlerChecker可能在COMPLETED阶段,有些可能在WAITED_HALF阶段。
流程
SystemServer的部分关键代码如下
/**
* The main entry point from zygote.
*/
public static void main(String[] args) {
new SystemServer().run();
}
private void run(){
...
// Start services.
try {
t.traceBegin("StartServices");
startBootstrapServices(t);
startCoreServices(t);
startOtherServices(t);
startApexServices(t);
} catch (Throwable ex) {
throw ex;
} finally {
t.traceEnd(); // StartServices
}
...
}
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
...
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
t.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
mDumper.addDumpable(watchdog);
t.traceEnd();
...
mActivityManagerService = ActivityManagerService.Lifecycle.startService(
mSystemServiceManager, atm);
...
watchdog.init(mSystemContext, mActivityManagerService);
}
可以看到Wtachdog是单例设计,并且在SystemServer启动服务初期就先行实例化和启动了。我们看下getInstance和start两个方法。
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
private Watchdog() {
mThread = new Thread(this::run, "watchdog"); //实例化Watchdog工作线程
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
//
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread"); //实例化mMonitorChecker
mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread")));
// Add checker for shared UI thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(UiThread.getHandler(), "ui thread")));
// And also check IO thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(IoThread.getHandler(), "i/o thread")));
// And the display thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(DisplayThread.getHandler(), "display thread")));
// And the animation thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(AnimationThread.getHandler(), "animation thread")));
// And the surface animation thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread")));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
...
}
/**
* Called by SystemServer to cause the internal thread to begin execution.
*/
public void start() {
mThread.start();
}
getInstance方法很简单,然后我们看下Watchdog构造方法。实例化了mMonitorChecker并添加到mHandlerCheckers列表里,同时往mHandlerCheckers里添加了很多HandlerChecker,上面可以看出Watchdog监视了FgThread,“main thread”,UiThread,IoThread,DisplayThread,AnimationThread,SurfaceAnimationThread这些线程,而实现了Monitor接口的服务基本都是在各自初始化时通过addMonitor方法添加到Watchdog的。而start方法也很简单,就是启动Watchdog的工作线程。从这里开始Watchdog就能够实现对系统重要服务是否堵塞进行监视。我们看下Watchdog.init方法
public void init(Context context, ActivityManagerService activity) {
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
final class RebootRequestReceiver extends BroadcastReceiver {
@Override
public void onReceive(Context c, Intent intent) {
if (intent.getIntExtra("nowait", 0) != 0) {
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
}
}
void rebootSystem(String reason) {
IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try {
pms.reboot(false, reason, false);
} catch (RemoteException ex) {
}
}
可以看到很简单,就是注册了一个重启的广播接受器,接收来自系统组件的重启广播进行系统重启。就是因为需要注册广播所以才在ActivityManagerService启动之后init。