Hang Detect 问题

原创

wx638ef1cfb1448 2022-12-07 01:25:58 博主文章分类：linux driver ©著作权

©著作权归作者所有：来自51CTO博客作者wx638ef1cfb1448的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、设计背景

2. 厂家难以分析

死机问题通常是概率性问题, 在售后难以分析, 而厂家拿回机器后, 往往难以复现, 导致问题往往不了了之，或者经过很长的时间才有结论.
死机问题往往需要比较多的系统关键资讯, 默认出厂版本, 这些资讯都不会抓取, 导致分析难度很大, 非常不利于厂家场测, CTA 测试, CMCC 测试等场景.

3. 传统死机保护设计缺陷

在Google Android 系统中, 死机保护是通过watchdog 机制来达成, 即将死机转换成重启.

HW Watchdog, 用于监测CPU 执行是否异常, 启用Kernel RT thread tick HW watchdog 来达成, 如果异常, 则重启整个系统.
System Server Watchdog, 用于监测Android System Server 关键线程和资源使用是否正常, 如果异常则重启android 上层.

严重缺陷.

HW Watchdog 和 System Server Watchdog 分离执行
System Server Watchdog 依赖于底层本身稳定性, 一旦Kernel 出现异常或者Native 层异常, System Server Watchdog 本身可能卡在
System Server Watchdog 后, android 重启, System Server 可能重启fail 掉, 导致一直卡死.

……

下列场景传统设计无法达成保护：

Android 虚拟机(ART/Dalvik) 异常, 导致Java 层代码无法执行, System Server 卡死.
File System 异常, 导致System Server Watchdog 卡在, 无法重启.
Memory leaks 或者Severely fragmented, 导致System Server 无法申请到memory 而卡在Kernel 中等待memory.
Kernel Driver 异常, 比如常见的Sensor Driver 卡在, 导致System Server 卡在后变成uninterruptible, system server 无法重启成功.
Surfaceflinger 卡在, system server 重启后, 将无法重启成功.
System Server Watchdog 抓取资讯时，如果自己卡住，那么将无法重启, 持续卡死.

……

二、设计原理

1. 设计原理

基于前面的说明，我们的设计关键在于将上层的system server watchdog 以及 HW watchdog 结合起来，同时进一步对上层的关键进程监控.

2. System Server Watchdog

SystemServer Watchdog 我们简称 SSW，最开始设计只是在main looper 里面执行register 的monitor 对象的monitor 方法. 后续Google 改进，通过HandlerChecker 来达成，在HandlerChecker 中审查注入的montior 对象是否能快速执行。

即在foreground thread 来执行register 的monitor 对象，而对于其他的线程，则是审查规定时间内是否可以达到idle. 不是一直卡死在某个message 执行上.

需要注意的是, SystemServer Watchdog 启动是在SystemServer init 的后期, 如果SystemServer 在init 的过程中卡死了，那么就意味着watchdog 不会有任何的作用.

226    /**
227     * The main entry point from zygote.
228     */
229    public static void main(String[] args) {
230        new SystemServer().run();
231    }
233    public SystemServer() {
234        // Check for factory test mode.
235        mFactoryTestMode = FactoryTest.getMode();
236    }
238    private void run() {
239        // If a device's clock is before 1970 (before 0), a lot of
240        // APIs crash dealing with negative numbers, notably
241        // java.io.File#setLastModified, so instead we fake it and
242        // hope that time from cell towers or NTP fixes it shortly.
243        if (System.currentTimeMillis() < EARLIEST_SUPPORTED_TIME) {
244            Slog.w(TAG, "System clock is before 1970; setting to 1970.");
245            SystemClock.setCurrentTimeMillis(EARLIEST_SUPPORTED_TIME);
        .......
         
318        // Initialize the system context.
319        createSystemContext();
320
321        // Create the system service manager.
322        mSystemServiceManager = new SystemServiceManager(mSystemContext);
323        LocalServices.addService(SystemServiceManager.class, mSystemServiceManager);
324
325        // Start services.
326        try {
327            startBootstrapServices();
328            startCoreServices();
329            startOtherServices();
330        } catch (Throwable ex) {
331            Slog.e("System", "******************************************");
332            Slog.e("System", "************ Failure starting system services", ex);
333            /// M: RecoveryManagerService  @{
334            if (mRecoveryManagerService != null && ex instanceof RuntimeException) {
335                mRecoveryManagerService.handleException((RuntimeException)ex, true);
336            }
337            /// @}
338        }
517    private void startOtherServices() {
518        final Context context = mSystemContext;
        .....
614            mSystemServiceManager.startService(AlarmManagerService.class);
615            alarm = IAlarmManager.Stub.asInterface(
616                    ServiceManager.getService(Context.ALARM_SERVICE));
617
618            Slog.i(TAG, "Init Watchdog");
619            final Watchdog watchdog = Watchdog.getInstance();
620            watchdog.init(context, mActivityManagerService);
        .....
         
1386                Watchdog.getInstance().start();
        .....
       }

3. surfaceflinger watchdog

Surfaceflinger Watchdog 为MTK 开发，简称其设计主要为快速检测到Surfaceflinger 是否有卡住的情况，并抓取backtrace 以便确认问题。目前其轮转周期非常快 1000ms.

117status_t SFWatchDog::readyToRun() {
118    XLOGI("[%s]", __func__);
119
120    mThreshold = SW_WATCHDOG_INIT_THRESHOLD; //800ms
121    mTimer = SW_WATCHDOG_TIMER; //1000ms
122    mShowLog = false;
123
124    return NO_ERROR;
125}

它提供了两个API 供外部调用, 如下所示.

352void SFWatchDog::markStartTransactionTime(uint32_t index) {
353    Mutex::Autolock _l(mLock);
354
355    if (index >= mNodeList.size()) {
356        XLOGE("[unmarkStartTransactionTime] index=%d > Node list size=%d", index, mNodeList.size());
357        return;
358    }
359
360    mNodeList[index]->mStartTransactionTime = systemTime();
361    mUpdateCount ++;
362    if (mShowLog)
363        XLOGV("[%s] name=%s, index=%d, time = %" PRId64 "", __func__, mNodeList[index]->mName.string(), index, mNodeList[index]->mStartTransactionTime);
364 }
365
366 void SFWatchDog::unmarkStartTransactionTime(uint32_t index) {
367    Mutex::Autolock _l(mLock);
368
369    if (index >= mNodeList.size()) {
370        XLOGE("[unmarkStartTransactionTime] index=%d > Node list size=%d", index, mNodeList.size());
371        return;
372    }
373
374    mNodeList[index]->mStartTransactionTime = 0;
375    mTransformRegionStorage.clear();
376
377    if (mShowLog)
378        XLOGV("[%s] name=%s, index=%d, time = %" PRId64 "", __func__, mNodeList[index]->mName.string(), index, mNodeList[index]->mStartTransactionTime);
379 }
380

目前检测的是SF Main thread 的行为, 即检测了SF 的onMessageReceived

937void SurfaceFlinger::onMessageReceived(int32_t what) {
938#ifdef MTK_AOSP_ENHANCEMENT
939    ATRACE_CALL_PERF();
940    // start watchdog
941    sp watchDog(SFWatchDog::getInstance());
942    watchDog->markStartTransactionTime(mWatchDogIndex);
943#else
944    ATRACE_CALL();
945#endif
946    switch (what) {
947        case MessageQueue::TRANSACTION: {
948            handleMessageTransaction();
949            break;
950        }
951        case MessageQueue::INVALIDATE: {
952            bool refreshNeeded = handleMessageTransaction();
953            refreshNeeded |= handleMessageInvalidate();
954            refreshNeeded |= mRepaintEverything;
955            if (refreshNeeded) {
956                // Signal a refresh if a transaction modified the window state,
957                // a new buffer was latched, or if HWC has requested a full
958                // repaint
959                signalRefresh();
960            }
961            break;
962        }
963        case MessageQueue::REFRESH: {
964            handleMessageRefresh();
965            break;
966        }
967    }
968#ifdef MTK_AOSP_ENHANCEMENT
969    // stop watchdog
970    watchDog->unmarkStartTransactionTime(mWatchDogIndex);
971#endif
972}

/data/anr/SF_RTT 当中.

4. hang detect

目前有开发两个feature：

一个用来监控system server 的watchdog 行为是否异常.
一个用来监控点亮屏幕的操作是否能在规定的时间内完成，因为此feature 存在一些误报的现象，目前还没有开启.

我们目前只说明第一个的实现.

首先，注册了一个device, /dev/RT_Monitor, 上层可以通过ioctl 和它进行沟通.

99 static long monitor_hang_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
100{
101 int ret = 0;
102 static long long monitor_status;
103 if (cmd == AEEIOCTL_WDT_KICK_POWERKEY) {
104     if ((int)arg == WDT_SETBY_WMS_DISABLE_PWK_MONITOR) {
105         /* pwk_start_monitor=0; */
106         /* wdt_kick_status=0; */
107         /* hwt_kick_times=0; */
108     } else if ((int)arg == WDT_SETBY_WMS_ENABLE_PWK_MONITOR) {
109         /* pwk_start_monitor=1; */
110         /* wdt_kick_status=0; */
111         /* hwt_kick_times=0; */
112     } else if ((int)arg < 0xf) {
113         aee_kernel_wdt_kick_Powkey_api("Powerkey ioctl", (int)arg);
114     }
115     return ret;
116
117 }
118/* QHQ RT Monitor */
119 if (cmd == AEEIOCTL_RT_MON_Kick) {
120     LOGE("AEEIOCTL_RT_MON_Kick ( %d)\n", (int)arg);
121     aee_kernel_RT_Monitor_api((int)arg);
122     return ret;
123 }
124//   LOGE("AEEIOCTL_RT_MON_Kick unknown cmd :(%d)( %d)\n",(int)cmd, (int)arg);
125//   LOGE("AEEIOCTL_RT_MON_Kick known cmd :(%d)( %d)\n",(int)AEEIOCTL_WDT_KICK_POWERKEY, (int)AEEIOCTL_RT_MON_Kick);
126/* QHQ RT Monitor end */
127
128 if (cmd == AEEIOCTL_SET_SF_STATE) {
129     if (copy_from_user(&monitor_status, (void __user *)arg, sizeof(long long)))
130         ret = -1;
131     LOGE("AEE_MONITOR_SET[status]: 0x%llx", monitor_status);
132     return ret;
133 } else if (cmd == AEEIOCTL_GET_SF_STATE) {
134     if (copy_to_user((void __user *)arg, &monitor_status, sizeof(long long)))
135         ret = -1;
136     return ret;
137 }
138
139 return -1;
140}

其中我们重点关注AEEIOCTL_RT_MON_Kick

445void aee_kernel_RT_Monitor_api(int lParam)
446{
447 if (0 == lParam) {
448     hd_detect_enabled = 0;
449     hang_detect_counter = hd_timeout;
450     LOGE("[Hang_Detect] hang_detect disabled\n");
451 } else if (lParam > 0) {
452     hd_detect_enabled = 1;
453     hang_detect_counter = hd_timeout = ((long)lParam + HD_INTER - 1) / (HD_INTER);
454     LOGE("[Hang_Detect] hang_detect enabled %d\n", hd_timeout);
455 }
456}

即在hang detect 中有两个关键的变数， hd_detect_enabled 指示当前hang detect 是否开启， hang_detect_counter 为计时器计数. hang_detect thread 每隔30s (HD_INTER) 将其减一。

其设计思想非常简单，即上层通过ioctl 此device, 设置hang_detect_counter, 即告知hang detect, 本人预计会在hang_detect_counter * 30s 之内再次来tick 您, 假如我没有在这个时间内tick 您，那么就意味着我已经牺牲了, 您就发动暴动, 重启手机.

对应hang_detect_thread 的实现就这个循环检测和操作hang_detect_counter .

385static int hang_detect_thread(void *arg)
386{
387
388 /* unsigned long flags; */
389 struct sched_param param = {.sched_priority = RTPM_PRIO_WDT };
390
391 LOGE("[Hang_Detect] hang_detect thread starts.\n");
392
393 sched_setscheduler(current, SCHED_FIFO, ¶m);
394
395 while (1) {
396     if ((1 == hd_detect_enabled) && (FindTaskByName("system_server") != -1)) {
397         LOGE("[Hang_Detect] hang_detect thread counts down %d:%d.\n",
398              hang_detect_counter, hd_timeout);
399
400
401         if (hang_detect_counter <= 0) {
402             ShowStatus();
403         }
404
405
406         if (hang_detect_counter == 0)
407         {
408             LOGE("[Hang_Detect] we should triger   HWT ...\n");
409             if(aee_mode!=AEE_MODE_CUSTOMER_USER)
410             {
411                 aee_kernel_warning_api(__FILE__, __LINE__, DB_OPT_NE_JBT_TRACES|DB_OPT_DISPLAY_HANG_DUMP, "\nCRDISPATCH_KEY:SS Hang\n","we triger HWT ");
412                 msleep(30 * 1000);
413             }
414             else //only Customer user load  trigger HWT
415             {
416                 //aee_kernel_exception_api(__FILE__, __LINE__, DB_OPT_NE_JBT_TRACES|DB_OPT_DISPLAY_HANG_DUMP, "\nCRDISPATCH_KEY:SS Hang\n","we triger HWT ");
417                 //msleep(30 * 1000);
418                 //local_irq_disable();
419                 //while (1);
420                 BUG();
421             }
422         }
423
424         hang_detect_counter--;
425     } else {
426         /* incase of system_server restart, we give 2 mins more.(4*HD_INTER) */
427         if (1 == hd_detect_enabled) {
428             hang_detect_counter = hd_timeout + 4;
429             hd_detect_enabled = 0;
430         }
431         LOGE("[Hang_Detect] hang_detect disabled.\n");
432     }
433
434     msleep((HD_INTER) * 1000);
435 }
436 return 0;
437}

5. System server Watchdog call hang detect

与之相关的代码其实很少，所以我直接贴出来，然后再来讲.

85  ExceptionLog exceptionHWT;
 
//init exceptionHWT
262        if (SystemProperties.get("ro.have_aee_feature").equals("1")) {
263            exceptionHWT = new ExceptionLog();
264        }
 
386         mSFHang = false;
387         if(exceptionHWT!= null){
388             exceptionHWT.WDTMatterJava(300);
389         }
390         if(mNeedDump){
391                // We've waited half the deadlock-detection interval.  Pull a stack
392                // trace and wait another half.
393                if(exceptionHWT!= null) exceptionHWT.WDTMatterJava(600);
394                dumpAllBackTraces(true);
395                mNeedDump = false;
396         }
397
 
495        if(exceptionHWT!= null) exceptionHWT.WDTMatterJava(720);

即System server watchdog 在轮转的过程中，会周期性的tick hang detect, 目前设计的timing 比较保守，我们会在M 版本将这个时间进一步降低优化. tick 分成三种:

正常情况下，tick 300s, 对应count=10.
在dump backtrace 时，tick 600s, 对应count=20.
在SWT 发生的情况下，tick 720s, 对应count=24.

三、问题分析

正如我们前面提到，目前hang detect 问题发生时，即意味着watchdog 因为一些特殊的原因没有正确的执行. 你的分析手法可以按下面的流程进行.

1. 确定System Server watchdog tick hang detect 的状态

这个可以从kernel log 中明确的看到.如：

watchdog 看起来正常：

[  198.215932] (1)[1322:watchdog]AEEIOCTL_RT_MON_Kick ( 300)
[  198.215945] (1)[1322:watchdog][Hang_Detect] hang_detect enabled 10

watchdog 在dump backtrace:

[  258.218145] (0)[1322:watchdog]AEEIOCTL_RT_MON_Kick ( 600)
[  258.218171] (0)[1322:watchdog][Hang_Detect] hang_detect enabled 20

watchdog 在做SWT:

[  299.046542] (0)[1322:watchdog]AEEIOCTL_RT_MON_Kick ( 720)
[  299.046572] (0)[1322:watchdog][Hang_Detect] hang_detect enabled 24

当然你也可以从hang detect thread 的log 中看到这个：

[  210.475572] (0)[90:hang_detect][Hang_Detect] init found pid:1.
[  210.475735] (0)[90:hang_detect][Hang_Detect] mmcqd/0 found pid:158.
[  210.475815] (0)[90:hang_detect][Hang_Detect] surfaceflinger found pid:265.
[  210.475887] (0)[90:hang_detect][Hang_Detect] system_server found pid:734.
[  210.475919] (0)[90:hang_detect][Hang_Detect] ndroid.systemui found pid:1071.
[  210.476003] (0)[90:hang_detect][Hang_Detect] debuggerd found pid:4313.
[  210.476027] (0)[90:hang_detect][Hang_Detect] debuggerd64 found pid:4314.
[  210.476056] (0)[90:hang_detect][Hang_Detect] hang_detect thread counts down 10:10.

2. 确认System Server Watchdog thread 状态

首先你需要明白， Watchdog 正常时，它是停留在java wait 函数，所以它对应的backtrace 即是

KERNEL SPACE BACKTRACE, sysTid=1299
 
    [] __switch_to+0x74/0x8c from [] 
    [] __schedule+0x314/0x794 from [] 
    [] schedule+0x24/0x68 from [] 
    [] futex_wait_queue_me+0xcc/0x158 from [] 
    [] futex_wait+0x120/0x20c from [] 
    [] do_futex+0x184/0xa48 from [] 
    [] SyS_futex+0x88/0x19c from [] 
    [] cpu_switch_to+0x48/0x4c from [] 
 
"watchdog" sysTid=1299
 
  #00 pc 0000000000019478  /system/lib64/libc.so (syscall+28)
  #01 pc 00000000000d0c74  /system/lib64/libart.so (art::ConditionVariable::TimedWait(art::Thread*, long, int)+168)
  #02 pc 00000000002a3e98  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, long, int, bool, art::ThreadState)+860)
  #03 pc 00000000002a4e04  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, art::mirror::Object*, long, int, bool, art::ThreadState)+244)
  #04 pc 0000000000000974  /data/dalvik-cache/arm64/system@framework@boot.oat

但注意的是因为在kernel 中，你只能看到kernel backtrace, 而SyS_futex 这个futex 只能表明是上层卡住, 可能是pthread mutex lock, 也可能是wait 也可能是其他，难以直接明了.

如果不是这个backtrace 那么很明显watchdog 已经卡住，这个时候即由它的backtrace 进行推理，以便确认进一步卡住的地方。

那如果你没有找到watchdog 线程怎么办呢？很简单，说明此时可能是两种情况，一种是system server 发生SWT 后，退出因为卡在了kernel 导致system server 某个线程无法退出，而这个线程就是你需要进一步分析的，另外一种情况是system server SWT 重启后，system server 初始化失败，watchdog 还没有启动就卡死了。这两种根据前后的LOG 很容易就可以判断出来。

3.关键process 状态追查

根据以往分析的经验，我们发现很多时候都是一些关键的process 卡住，导致watchdog 的行为无法达成. 所以我们在hang detect 中打印了关键的Process 的资讯，以便快速追查问题.

[  300.505542] (0)[90:hang_detect][Hang_Detect] init found pid:1.
[  300.505616] (0)[90:hang_detect][Hang_Detect] mmcqd/0 found pid:158.
[  300.505649] (0)[90:hang_detect][Hang_Detect] surfaceflinger found pid:265.
[  300.505678] (0)[90:hang_detect][Hang_Detect] system_server found pid:734.
[  300.505688] (0)[90:hang_detect][Hang_Detect] ndroid.systemui found pid:1071.
[  300.505719] (0)[90:hang_detect][Hang_Detect] debuggerd found pid:4313.
[  300.505726] (0)[90:hang_detect][Hang_Detect] debuggerd64 found pid:4314.
[  300.505735] (0)[90:hang_detect][Hang_Detect] hang_detect thread counts down 0:24.

比如我们会将这些关键的process 的kernel thread 打印，还会打印系统的memory 状况，以及系统中所有的 ‘D’ 的process 状态以便快速分析问题。

init 卡住的话，就意味着任何system property 的设置，或者service 的重启，wifi 的设置等都会卡住.

mmcqd 卡住的话，就意味着emmc 可能卡住.

surfaceflinger 卡住的话，就意味着display driver 和 GPU 可能卡住.

systemui 卡住的话，就意味着keyguard 可能卡住.

debuggerd 卡住的话，Watchdog 抓取backtrace 时就会卡住.

......

4. 保存现场快速分析.

当前面的资讯都难以最终定位root cause 时，我们就需要转变思想，将重启再变成死机, 即我们把hang_detect_thread 中的重启(call bug) 去除，或者直接使用eng 版本复现，目标就是让其死机，然后再用死机现场来分析。

MTK 提供team viewer 远程debug 支持，如果贵司有死机现场，可以让MTK 同仁通过team viewer 远程debug 以便快速定位问题.

当我们找到root cause 之后，再回过头来想，hang detect 缺少了哪些资讯导致我们无法分析成功，然后再修改hang detect.

这是一个持续渐进的过程, 欢迎大家提供更多有效的建议.