watchdog 是什么

A Watchdog Timer (WDT) is a hardware circuit that can reset the computer system in case of a software fault. You probably knew that already. Usually a userspace daemon will notify the kernel watchdog driver via the /dev/watchdog special device file that userspace is still alive, at regular intervals. When such a notification occurs, the driver will usually tell the hardware watchdog that everything is in order, and that the watchdog should wait for yet another little while to reset the system. If userspace fails (RAM error, kernel bug, whatever), the notifications cease to occur, and the hardware watchdog will reset the system (causing a reboot) after the timeout occurs.

看门狗是一种监控系统的运行状况的手段,通过软硬件结合的方式实现对系统运行状况的监控。稳定运行的软件会在执行完特定指令后进行喂狗,若在一定周期内看门狗没有收到来自软件的喂狗信号,则认为系统故障,会进入中断处理程序或强制系统复位。

在Linux 内核下, watchdog的基本工作原理是:当watchdog启动后(即/dev/watchdog 设备被打开后),如果在某一设定的时间间隔内/dev/watchdog没有被执行写操作, 硬件watchdog电路或软件定时器就会重新启动系统。/dev/watchdog 是一个主设备号为10, 从设备号130的字符设备节点。 Linux内核不仅为各种不同类型的watchdog硬件电路提供了驱动,还提供了一个基于定时器的纯软件watchdog驱动。

watchdog的生效,需要硬件和软件的支持。

The feature relies on two components: • A hardware component that sets a 30-seconds timer (your seconds value might be different). When the timer expires, the component triggers a system restart. On bare-metal machines, a chipset provides the feature. • Run by the operating system, a software component regularly resets the hardware timer to prevent it from expiring. When the operating system hangs, the software component also hangs and cannot refresh the timer. The timer expires and the system restarts. Watchdogs only monitor operating systems and do not detect application failures. For example, the watchdog does not trigger when the application hangs, but the operating system is still responding.

vm的watchdog设备(硬件):

x86_64 的vm当前支持两种型号的watchdog, i6300esb 和 ib700:

<devices>
    <watchdog model='i6300esb/ib700' action='poweroff/reset/shutdown/poweroff/pause/none/dump/inject-nmi'/>
  </devices>

action 表示watchdog被触发以后,guest的操作,可以是关机,重启,暂停,可以dump,也可以inject-nmi,或者什么也不做;

linux系统对vm的watchdog的软件支持(软件):

1. watchdog device node

linux系统启动后就会有一个watchdog的字符设备,不管硬件的watchdog是否存在。这个叫Watchdog device node。如果不存在可以通过mknod来创建:

# ll /dev/watchdog
crw-------. 1 root root 10, 130 Jan  6 17:36 /dev/watchdog
## You can recreate it if it not exists.
# mknod /dev/watchdog c 10 130

2. watchdog硬件设备的驱动

# lspci | grep -i 6300
08:01.0 System peripheral: Intel Corporation 6300ESB Watchdog Timer
# lspci -vvv -s 08:01.0 | grep "Kernel modules"
	Kernel modules: i6300esb

怎样触发guest的watchdog action

以下面的设备为例,有两种方式可以触发watchdog action,来验证硬件是否生效。

<devices>
    <watchdog model='i6300esb' action='reset'/>
  </devices>

Trigger方法1:

使用其他的进程,比如cat, echo写入 /dev/watchdog 设备。

if an application opens this device file, it becomes responsible of the watchdog, and can reset it by writing to the file. The system(watchdog daemon) periodically keeps writing to /dev/watchdog. It is also called “kicking or feeding the watchdog”. If the system fails to kick or feed the watchdog, then after a while the system is hard reset by the hardware watchdog.

在guest执行:

# echo 1 > /dev/watchdog
[   21.608056] watchdog: watchdog0: watchdog did not stop!

# cat >> /dev/watchdog

到了30s会自动重启。不需要安装任何软件或启动任何服务。

Trigger方法2:

某一设定的时间间隔内/dev/watchdog没有被执行写操作,会trigger watchdog。

if no application opens the /dev/watchdog file, then the kernel takes care of resetting the watchdog(when watchdog daemon is active and running). The watchdog module is a timer, it won't appear as a dedicated kernel thread, but handled by the soft IRQ thread.

  1. 在guest中先安装watchdog的软件:
# yum install watchdog   
  1. 配置/etc/watchdog.conf:
# cat /etc/watchdog.conf:
# Interval between tests. Should be a couple of seconds shorter than
# the hardware time-out value.
interval = 61  //should be larger than 60.

interval 是指 kernel 两次喂狗的间隔,应该小于硬件的time-out的值,否则就会喂狗失败,触发watchdog的action。因为默认的hardware time-out时间是60s。所以我们设置61s,就会trigger watchdog。

If the device is opened but not written to within a minute, the machine will reboot.

  1. 在guest中启动watchdog的进程,等待61s:
# systemctl start watchdog
Job for watchdog.service failed because the control process exited with error code.
See "systemctl status watchdog.service" and "journalctl -xeu watchdog.service" for details.
# watchdog
watchdog: This interval length (61) might reboot the system while the process sleeps! Try 59 or less
watchdog: To force parameter(s) use the --force command line option.
# watchdog -f

到了61s会自动重启。

Refer to: https://libvirt.org/formatdomain.html#watchdog-device http://junyelee.blogspot.com/2021/07/linux-watchdog.html#:~:text=The%20watchdog%20is%20automatically%20started,its%20configuration%20file%20in%20turn. https://mjmwired.net/kernel/Documentation/watchdog/watchdog-api.txt