文章大纲

  • 引言
  • 一、SystemServer进程的Watchdog 机制概述
  • 二、com.android.server.Watchdog
  • 1、Watchdog#Watchdog()构造方法
  • 2、com.android.server.Watchdog.HandlerChecker
  • 2.1、com.android.server.Watchdog.HandlerChecker 类检查被监控的线程本身和被监控Binder服务的状态
  • 2.2、HandlerChecker(Handler handler, String name, long waitMaxMillis)方法
  • 2.3、HandlerChecker #scheduleCheckLocked() 方法检查被监控的线程本身以及监控服务自身的状态
  • 2.3.1、mHandler.postAtFrontOfQueue(this) 方法发送消息,HandlerChecker #run方法处理消息
  • 3、Watchdog 提供用于添加绑定监控对象的方法:addThread 和addMonitor
  • 3.1、`addThread`(Handler thread, long timeoutMillis)方法用于绑定要监控的`普通线程`
  • 3.2、`addMonitor`(Monitor)方法用于绑定要监控的`Binder服务线程`
  • 4、com.android.server.Watchdog#run()方法
  • 4.1、Watchdog.HandlerChecker#scheduleCheckLocked方法给所有被监控的线程发送Handler消息
  • 5、com.android.server.Watchdog#evaluateCheckerCompletionLocked方法
  • 三、SystemServer#startOtherServices方法里创建Watchdog


引言

android 普通APP实现SYSTEM_UID android system view_Android


最初为了确保操作系统全天候都能正常工作,因此引入了一套硬件监控机制Watchdog (“看门狗”),在设备中增加一个硬件看门狗,软件操作系统必须定时地向看门狗硬件中写入指定的值以示自己没出故障(俗称“喂狗”),否则超过一定的时间没有“喂狗”,看门狗则会自动重启设备

一、SystemServer进程的Watchdog 机制概述

在Android 中Init进程启动的watchd守护进程就是负责给“硬件看门狗喂食”的,但”硬件看门狗“功能比较单一只能监控整个系统,不便于监控某个具体的线程(因为每隔一定时间需要线程去喂狗的话,浪费CPU且增加程序设计成本)。于是乎Android 为了监控SystemServer进程中的所有服务线程便重新设计了一套软件看门狗机制,即SystemServer 里创建的Watchdog 用于监控SystemServer各个具体的线程,一旦Watchdog 发现某个系统服务异常,则会杀死SystemServer进程,当Zygote进程接到SystemServer进程死亡的信号后,自动杀死自己,当Zygote进程死亡信号传递到Init进程后,Init进程会杀死Zygote所有的子进程并重启Zygote,相当于是把手机软重启了一遍,因为通常SystemServer 进程出现异常时和Kernel 没有多大关系,往往重启就能解决。

\frameworks\base\services\core\java\com\android\server\Watchdog.java

二、com.android.server.Watchdog

com.android.server.Watchdog继承自Thread类,内部封装了两个重要的类:com.android.server.Watchdog.HandlerChecker 和 com.android.server.Watchdog.BinderThreadMonitor以及一个接口com.android.server.Watchdog.Monitor,Watchdog会运行在一个独立的线程中,通过无线循环去遍历HandlerChecker 列表并发送Handler消息,睡眠片刻后再去检查是否有服务或者线程出现异常了。简而言之,通过Watchdog我们可以去监控线程或者Binder服务是否正常工作

1、Watchdog#Watchdog()构造方法

public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }
        return sWatchdog;
    }
    private Watchdog() {
        // The shared foreground thread is the main checker.  It is where we will also dispatch monitor checks and do other work. 众多 HandlerChecker 中 mMonitorChecker 是最特殊的一个,它的作用是用来检测监听的服务是否死锁 
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),"foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),"main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),"ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),"i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());
    }

在Watchdog的构造方法里,主要做了两件事:

  • 初始化创建几大类公共线程的HandlerChecker对象并存放到HandlerCheck列表中
  • 初始化用户监控Binder服务线程的Monitor对象

外部是通过单例形式去获取Watchdog实例的,另外SystemServer 中一些最重要的服务拥有专有的线程来处理消息,这些专用的线程也被加入到了Watchdog的监控之下:

ActivityManagerService的AThread 线程、PackageManagerService的mHandlerThread变量对应的线程、WindowManagerService的wmHandlerThread变量对应的线程、PowerManagerService的线程。

2、com.android.server.Watchdog.HandlerChecker

/** This class calls its monitor every minute. Killing this process if they don't return **/
public class Watchdog extends Thread {
       /* This handler will be used to post message back onto the main thread */
    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
	...
	/**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     */
    public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final String mName;
        private final long mWaitMax;
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }
			
        public void addMonitor(Monitor monitor) {
            mMonitors.add(monitor);
        }

        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread.  Note that we
                // only do this if mCheckReboot is false and we have no
                // monitors, since those would need to be executed at this point.
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

        public boolean isOverdueLocked() {
            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }

        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }
		...
        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
    }

2.1、com.android.server.Watchdog.HandlerChecker 类检查被监控的线程本身和被监控Binder服务的状态

HandlerChecker 实现了Runnable 接口,每一个HandlerChecker 对象对应一个被监控的线程,借助Handler向北监控的线程发送消息来判断是否正常运行,若发送的消息不能再规定的时间被得到响应,则表示该线程被不正常占用了,即没有正常运行。

2.2、HandlerChecker(Handler handler, String name, long waitMaxMillis)方法

通过Handler对象、被监控的线程名以及最大等待处理时间ms来创建 HandlerChecker对象。

2.3、HandlerChecker #scheduleCheckLocked() 方法检查被监控的线程本身以及监控服务自身的状态

无限循环遍历Watchdog的HandlerChecker列表并给所有监控的线程发送消息.

2.3.1、mHandler.postAtFrontOfQueue(this) 方法发送消息,HandlerChecker #run方法处理消息

mHandler.postAtFrontOfQueue(this)方法发送消息后,对应处理消息的方法就是run,若run被执行则说明被监控的线程本身没有问题。

3、Watchdog 提供用于添加绑定监控对象的方法:addThread 和addMonitor

3.1、addThread(Handler thread, long timeoutMillis)方法用于绑定要监控的普通线程

addThread 方法是用于绑定要监控的普通线程的,实际上就是创建一个HandlerCheck对象并添加到HandlerChecker列表里。

public class Watchdog extends Thread {
    ...
/* This handler will be used to post message back onto the main thread */
final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
...
    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            if (isAlive()) {//判断Watchdog所在的线程是否还alive
                throw new RuntimeException("Threads can't be added once the Watchdog is running");
            }
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

3.2、addMonitor(Monitor)方法用于绑定要监控的Binder服务线程

因为Binder调用是在底层的Binder线程池的某个线程里,但是执行的线程并不固定,因此不能使用监控普通线程的方法来判断某个Binder服务是否正常运行,于是乎需要通过另一个方法addMonitor 来绑定Binder服务线程来检查,实际上就是调用HandlerChecker#addMonitor方法并添加到HandlerChecker的Monitor列表中

public class Watchdog extends Thread {
    ...
/* This handler will be used to post message back onto the main thread */
final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
final HandlerChecker mMonitorChecker;  
...
public void addMonitor(Monitor monitor) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            }
            mMonitorChecker.addMonitor(monitor);
        }
    }

因此如果一个Binder服务线程需要被Watchdog 监控只需两步:首先得自身实现com.android.server.Watchdog.Monitor接口:

public class Watchdog extends Thread {
    ...
    public interface Monitor {
        void monitor();
    }

然后调用Watchdog#addMonitor方法把自己绑定到Watchdog的监控列表中,系统服务像:ActivityManagerService、InputManagerService、WindowManagerService、PowerManagerService、MountService等。通常Binder 服务里都是需要synchronized 来保护全局资源,因此是可以通过锁的持有的时间是否超长来判断服务是否正常

4、com.android.server.Watchdog#run()方法

Watchdog启动后run方法被执行,然后无限循环遍历前面保存的HandlerChecker 列表,获取Handlerchecker并调用Watchdog.HandlerChecker#scheduleCheckLocked方法给所有被监控的线程发送Handler消息,消息发生完毕后,然后调用wait方法让Watchdog睡眠一段时间,最后逐个检查是否有普通线程或者Binder服务出问题了,如果出问题则马上杀死SystemServer进程。

/** This class calls its monitor every minute. Killing this process if they don't return **/
public class Watchdog extends Thread {
       /* This handler will be used to post message back onto the main thread */
    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
	...
	 @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                //遍历给所有被监控的线程发送Handler消息
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }
                //睡眠一段时间
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    try {
                        wait(timeout);
                    } 
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }
                //遍历所有被监控的线程是否状态出问题了
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                            getInterestingNativePids());
                        waitedHalf = true;
                    }
                    continue;
                }
            }
            ...
            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                Process.killProcess(Process.myPid());
                System.exit(10);
            }
            waitedHalf = false;
        }
    }
    }

4.1、Watchdog.HandlerChecker#scheduleCheckLocked方法给所有被监控的线程发送Handler消息

首先判断HandlerChecker对象是否监控了Binder服务,Monitor列表size为0则说明没有监控binder服务且此时被监控的消息队列处于空闲状态则说明线程运行良好,则把mCompleted置为true然后返回,否则先置为false,再记录消息开始发送的时间,再去调用Handler#postAtFrontOfQueue方法给被监控的线程发送一个消息

/**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     */
    public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final long mWaitMax;
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;
        
        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
                mCompleted = true;
                return;
            }
            if (!mCompleted) {
                return;
            }
            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

接着就会触发HandlerChecker#run()方法去处理消息,只要HandlerChecker#run方法被执行就说明被监控的线程本身没有问题,

public final class HandlerChecker implements Runnable {
    ...
   @Override
 	public void run() {
        final int size = mMonitors.size();
        for (int i = 0 ; i < size ; i++) {
            synchronized (Watchdog.this) {
                mCurrentMonitor = mMonitors.get(i);
            }
            mCurrentMonitor.monitor();
        }

        synchronized (Watchdog.this) {
            mCompleted = true;
            mCurrentMonitor = null;
        }
    }
}

但是还需要检测被监控Binder服务的状态,因此需要调用Binder服务中实现Watchdog.Monitor接口时的monitor方法去检查Binder服务的状态问题

public class ActivityManagerService extends IActivityManager.Stub
    implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }

你没有看错基本上Binder服务的monitor方法就是这样子的实现,没有删减一行代码

主要就是通过获取Binder服务的锁,反之得到则mCompleted为true就说明HandlerChecker监控的线程和服务都是正常的,如果不能得到则线程会挂起,自然mCompleted的值就不能为true了,则表示线程或者服务可能出了问题,还需要进一步结合Watchdog.HandlerChecker#getCompletionStateLocked是否超过规定时间来判断。

public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

5、com.android.server.Watchdog#evaluateCheckerCompletionLocked方法

private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerChecker hc = mHandlerCheckers.get(i);
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        return state;
    }

遍历前面保存的HandlerChecker 列表并调用Watchdog.HandlerChecker#getCompletionStateLocked方法获取每一个对象的状态取值:

状态

取值

说明

COMPLETED

0

表示状态良好

WAITING

1

表示正在等待消息处理的结束

WAITED_HALF

2

表示正在等待消息处理的结束且已经超过了规定时间的一半

OVERDUE

3

表示正在等待消息处理的结束且已经超过了规定时间,只要状态值是OVERDUE则会杀死SystemServer进程

简单来说,是通过时间来判断被监控对象的状况的。

三、SystemServer#startOtherServices方法里创建Watchdog

SystemServer进程在启动时只是通过单例的形式创建了Watchdog对象并进行初始化,当其他服务或线程需要被监控时就通过com.android.server.Watchdog#getInstance方法获取实例进而调用对应的方法把自己注册到Watchdog监控列表中。

private void startOtherServices() {
        final Context context = mSystemContext; 
        ...
		  final Watchdog watchdog = Watchdog.getInstance();
        watchdog.init(context, mActivityManagerService);
        ...

Watchdog#init方法

public void init(Context context, ActivityManagerService activity) {
        mResolver = context.getContentResolver();
        mActivity = activity;

        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }