pod通过两类探针来检查容器的健康状态。分别是LivenessProbe(存活性探测)和ReadinessProbe(就绪型探测)

  1. LivenessProbe探针(存活性探测)
    用于判断容器是否健康(Running状态)并反馈给kubelet。
    有不少应用程序长时间持续运行后会逐渐转为不可用的状态,并且仅能通过重启操作恢复,kubernetes的容器存活性探测机制可发现诸如此类问题,并依据探测结果结合重启策略触发后的行为。
    存活性探测是隶属于容器级别的配置,kubelet可基于它判定何时需要重启一个容器。
    如果一个容器不包含LivenessProbe探针,那么kubelet认为该容器的LivenessProbe探针返回的值永远是Success。
  2. ReadinessProbe探针(就绪型探测)
    用于判断容器服务是否可用(Ready状态),达到Ready状态的Pod才可以接收请求。
    对于被Service管理的Pod,Service与Pod Endpoint的关联关系也将基于Pod是否Ready进行设置。
    如果在运行过程中Ready状态变为False,则系统自动将其从Service的后端Endpoint列表中隔离出去,后续再把恢复到Ready状态的Pod加回后端Endpoint列表。
    这样就能保证客户端在访问Service时不会被转发到服务不可用的Pod示例上。

探针的实现方式

LivenessProbe和ReadinessProbe均可配置以下三种探针实现方式:

  1. ExecAction
    通过在目标容器中执行由用户自定义的命令来判定容器的健康状态,即在容器内部执行一个命令,如果该命令的返回码为0,则表明容器健康。

在下面的例子中,通过执行“cat /tmp/health” 命令来判断一个容器运行十分正常。
在该pod运行后。将在创建/tmp/health 文件10s后删除该文件。
LivenessProbe健康检查的初始探测时间(initialDelaySeconds)为15s,探测结果为Fail,将导致kubelet 杀掉该容器并重启它:

vim liveness-exec.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
    - name: liveness
      image: busybox
      args: ["/bin/sh","-c","echo ok > /tmp/health; sleep 10; rm -rf /tmp/health; sleep 600"]
      livenessProbe:
        exec:
          command: ["cat","/tmp/health"]
        initialDelaySeconds: 15
        timeoutSeconds: 1

创建pod:

[root@bogon ~]# kubectl create -f liveness-exec.yaml 
pod/liveness-exec created
[root@bogon ~]# kubectl get pods
NAME                           READY   STATUS      RESTARTS   AGE
dapi-test-pod                  0/1     Completed   0          8h
dapi-test-pod-container-vars   1/1     Running     0          7h8m
dapi-test-pod-volume           1/1     Running     0          4h49m
liveness-exec                  1/1     Running     0          7s

查看详情
kubectl describe pods/liveness-exec
会发现restart 字样

Events:
  Type     Reason     Age                        From               Message
  ----     ------     ----                       ----               -------
  Normal   Scheduled  <unknown>                  default-scheduler  Successfully assigned default/liveness-exec to server01
  Normal   Pulled     8s (x3 over 2m31s)         kubelet, server01  Successfully pulled image "busybox"
  Normal   Created    8s (x3 over 2m31s)         kubelet, server01  Created container liveness
  Normal   Started    8s (x3 over 2m31s)         kubelet, server01  Started container liveness
  Warning  Unhealthy  <invalid> (x9 over 2m11s)  kubelet, server01  Liveness probe failed: cat: can't open '/tmp/health': No such file or directory
  Normal   Killing    <invalid> (x3 over 112s)   kubelet, ****server01  Container liveness failed liveness probe, will be restarted****
  Normal   Pulling    <invalid> (x4 over 2m36s)  kubelet, server01  Pulling image "busybox"
  1. TCPSocketAction
    通过容器的IP地址和端口号进行TCP检查,如果能够建立TCP连接,则表明容器健康。

在下面的例子中,通过与容器内的localhost:80端口建立tcp连接进行健康检查:
vim pod-with-healthcheck.yaml

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-healthcheck
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
    livenessProbe:
      tcpSocket:
        port: 80
      initialDelaySeconds: 30
      timeoutSeconds: 1

创建pod

[root@bogon ~]# kubectl create -f pod-with-healthcheck.yaml
[root@bogon ~]# kubectl get pods
NAME                           READY   STATUS      RESTARTS   AGE
pod-with-healthcheck           1/1     Running     0          45s

查看详情:
[root@bogon ~]# kubectl describe pod pod-with-healthcheck

Containers:
  nginx:
    Container ID:   docker://19f987e8146d926fd28e04d1fa9677d60cb80f20e42a36fc20fba346412113c5
    Image:          nginx
    Image ID:       docker-pullable://nginx@sha256:a93c8a0b0974c967aebe868a186e5c205f4d3bcb5423a56559f2f9599074bbcd
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 27 Jun 2020 21:11:23 +0800
    Ready:          True
    Restart Count:  0
    Liveness:       tcp-socket :80 delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-btm7g (ro)
  1. HTTPGetAction
    通过容器的ip地址,端口号及路径调用HTTPGet方法,如果响应的状态码大于等于200且小于400,则认为容器健康

在下面的例子中,kubelet 定时发送HTTP请求到localhost:80/_status/healthz来进行容器应用的监控检查:
cat pod-http-get-action.yaml

apiVersion: v1
kind: Pod
metadata:
  name: pod-http-get-action
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
    livenessProbe:
      httpGet:
        path: /_status/healthz
        port: 80
      initialDelaySeconds: 30
      timeoutSeconds: 1

查看详情
kubectl describe pod pod-http-get-action

Events:
  Type     Reason     Age                            From               Message
  ----     ------     ----                           ----               -------
  Normal   Scheduled  <unknown>                      default-scheduler  Successfully assigned default/pod-http-get-action to server01
  Normal   Pulling    <invalid> (x2 over 30s)        kubelet, server01  Pulling image "nginx"
  Normal   Killing    <invalid>                      kubelet, server01  Container nginx failed liveness probe, will be restarted
  Normal   Pulled     <invalid> (x2 over 29s)        kubelet, server01  Successfully pulled image "nginx"
  Normal   Created    <invalid> (x2 over 29s)        kubelet, server01  Created container nginx
  Normal   Started    <invalid> (x2 over 28s)        kubelet, server01  Started container nginx
  Warning  Unhealthy  <invalid> (x4 over <invalid>)  kubelet, server01  Liveness probe failed: HTTP probe failed with statuscode: 404

该pod 已经重启了3次

[root@bogon ~]# kubectl get pods
NAME                           READY   STATUS      RESTARTS   AGE
pod-http-get-action            1/1     Running     3          3m28s

对于每种探测方式,都需要设置 initialDelaySeconds 和timeoutSeconds 两个参数。他们的含义分别是:

  • initialDelaySeconds: 启动容器后进行首次监控检查的等待时间,单位为s.
  • timeoutSeconds:健康检查发送请求后等待响应的超时时间,单位为s.
    当超时发生时,kubelet会认为容器已经无法提供服务,将会重启该容器。