1 概述:

1.1 代码环境

版本信息如下:
a、kubernetes集群:v1.15.4

1.2 Pod删除的过程简述

当用户执行kubectl delete pod命令时(实际上是带grace-period=30s),实际上是访问kube-apiserver的DELETE接口(此时业务逻辑做的只是更新Pod对象的元信息(DeletionTimestamp字段和DeletionGracePeriodSeconds字段),并没有在etcd中删除记录),此时kubectl命令的执行会阻塞并显示正在删除pod。当kubelet组件监听到Pod对象的更新事件,则开始执行响应的回调方法(因为存在DeletionTimestamp字段,业务逻辑中会执行killPod()方法)。过了一小会时间,kubelet会监听到pod的删除事件,则调用相应的回调方法(访问kube-apiserver的DELETE接口,并且带grace-period=0),此时kube-apiserver的DELETE接口会去etcd中删除pod对象,此时用户kubectl get pod则才真正看不见pod对象,因为记录真的是被删除了。



kube-apiserver的DELETE接口第一次被触发,此时进入if语句后直接方法返回,debug截图如下:

k8s删除deployment之后pod还在 k8s删除pod流程_Pod


kube-apiserver的DELETE接口第二次被触发,不进入if语句,继续执行后续的业务逻辑(从etcd照中删除pod对象),debug截图如下:

k8s删除deployment之后pod还在 k8s删除pod流程_go_02



2 主干源码分析

2.1 kube-apiserver的DELETE接口的HANDLER

kube-apiserver是web服务器,启动的时候会注册HTTP Handler,DELETE接口如下

func (a *APIInstaller) registerResourceHandlers(path string, storage rest.Storage, ws *restful.WebService) (*metav1.APIResource, error) {

    switch action.Verb {
        
    case "DELETE": // Delete a resource.  删除一个api资源对象
                /*
                其他代码
                */
                //handler的主要逻辑在于restfulDeleteResource()方法
                handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent,
                
                    restfulDeleteResource(gracefulDeleter, isGracefulDeleter, reqScope, admit))
                    
                route := ws.DELETE(action.Path).To(handler).                    
                    /*
                    其他代码
                    */
                    Returns(http.StatusOK, "OK", versionedStatus).                    
                /*
                其他代码
                */
                addParams(route, action.Params)
                routes = append(routes, route)
    }
}

DELETE接口的业务逻辑其实是一个静态方法,位于staging/src/k8s.io/apiserver/pkg/endpoints/handlers/delete.go

func restfulDeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction {
    return func(req *restful.Request, res *restful.Response) {
        //调用一个静态方法,来自staging/src/k8s.io/apiserver/pkg/endpoints/handlers/delete.go
        handlers.DeleteResource(r, allowsOptions, &scope, admit)(res.ResponseWriter, req.Request)
    }
}


//当用户执行kubectl delete pod PODA时,本方法会被触发两次。
//第一次由kubectl的访问而触发
//第二次由kubelet组件的statusManager模块的访问而触发。
func DeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope *RequestScope, admit admission.Interface) http.HandlerFunc {
    return func(w http.ResponseWriter, req *http.Request) {        
        trace := utiltrace.New("Delete " + req.URL.Path)    
        /*
        其他代码
        */
        options := &metav1.DeleteOptions{}        
        trace.Step("About to delete object from database")

        result, err := finishRequest(timeout, func() (runtime.Object, error) {
        	//重点在 r.Delete(...)
            obj, deleted, err := r.Delete(ctx, name, rest.AdmissionToValidateObjectDeleteFunc(admit, staticAdmissionAttrs, scope), options)
            /*
            其他代码
            */
            return obj, err
        })
        /*
        检查性代码
        */
        trace.Step("Object deleted from database")

        status := http.StatusOK
        /*
        其他代码
        */
        //向客户端返回响应
        transformResponseObject(ctx, scope, trace, req, w, status, outputMediaType, result)
    }
}
func (e *Store) Delete(ctx context.Context, name string, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions) (...) {
    key, err := e.KeyFunc(ctx, name)
    /*
    检查性代码、无关紧要的代码
    */
    if graceful || pendingFinalizers || shouldUpdateFinalizers {
        //更新pod对象的元数据
        err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, deleteValidation, obj)
    }

	//第一次来到此处,直接返回
    // !deleteImmediately covers all cases where err != nil. We keep both to be future-proof.
    if !deleteImmediately || err != nil {
        return out, false, err
    }

    //第二次才会到达此处
    klog.V(6).Infof("going to delete %s from registry: ", name)
    //从etcd中删除对象
    e.Storage.Delete(...)
}

func (e *Store) updateForGracefulDeletionAndFinalizers(...) (...){
    /*
    其他代码
    */
    graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, existing, options)
    /*
    其他代码
    */
}

func BeforeDelete(...) (...){
    //修改目标对象的元数据:DeletionTimestamp字段和DeletionGracePeriodSeconds字段
    objectMeta.SetDeletionTimestamp(&now)
    objectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds)
}



2.1 kubelet的处理流程

2.1.1 kubelet监听到pod对象的更新事件

主循环中会执行syncPod(),而syncPod()逻辑会执行 kl.killPod(…)方法

func (kl *Kubelet) syncPod(o syncPodOptions) error {
      /*
        其他代码
        */
    //pod对象具备DeletionTimestamp字段则进入if语句
    if !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {    
        //killPod(..)调用容器运行时来停止pod中容器
        if err := kl.killPod(pod, nil, podStatus, nil); err != nil {
        /*
            其他代码
        */
        } else {
          /*
            其他代码
        */
        }
        return syncErr
    }

}
func (kl *Kubelet) killPod(pod *v1.Pod, runningPod *kubecontainer.Pod, status *kubecontainer.PodStatus, gracePeriodOverride *int64) error {
	var p kubecontainer.Pod
	 /*
            其他代码
    */
	// 调用容器运行时停止pod中的容器
	if err := kl.containerRuntime.KillPod(pod, p, gracePeriodOverride); err != nil {
		return err
	}
	if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
		klog.V(2).Infof("Failed to update QoS cgroups while killing pod: %v", err)
	}
	return nil
}



2.1.2 kubelet监听到pod对象的删除事件

statusManager的协程会执行m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions),从而使得kube-apiserver会从etcd中删除pod对象。

//kubelet组件有一个statusManager模块,它会for循环调用syncPod()方法
//方法内部有机会调用kube-apiserver的DELETE接口(强制删除,非平滑)
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    /*
    其他代码
    */
    //当pod带有DeletionTimestamp字段,并且其内容器已被删除、持久卷已被删除等的多条件下,才会进入if语句内部
    if m.canBeDeleted(pod, status.status) {
        deleteOptions := metav1.NewDeleteOptions(0)
        deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))
        
        //强制删除pod对象:kubectl delete pod podA --grace-period=0
        err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)
        
        /*
        其他代码
        */
    }
}



3 官方英文文档-Termination of Pods

Because Pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up).
Users should be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete.
#当一个用户发送一个delete pod的请求,系统会记录一个平滑时间后往Pod中每个容器的主进程发送一个TERM信号
When a user requests deletion of a Pod, the system records the intended grace period before the Pod is allowed to be forcefully killed, and a [ TERM signal ] is sent to the main process in each container. 
#当平滑时间到达,KILL信号发送到Pod中每个容器的主进程,apiServer也将Pod对象删除
Once the grace period has expired, the [ KILL signal ] is sent to those processes, and the Pod is then deleted from the API server. 
If the Kubelet or the container manager is restarted while waiting for processes to terminate, the termination will be retried with the full grace period.

An example flow:
	1. 
		User sends command to delete Pod, with default grace period (30s)
	2. 
		The Pod in the API server is updated with the time beyond which the Pod is considered “dead” along with the grace period.
	3. 
		Pod shows up as [ "Terminating" ] when listed in client commands
	4. 
		(simultaneous with 3) When the [ Kubelet ] sees that a Pod has been marked as terminating because the time in 2 has been set, it begins the pod shutdown process.

		4.1. 
				If the pod has defined a preStop hook, it is invoked inside of the pod. If the preStop hook is still running after the grace period expires, step 2 is then invoked with a small (2 second) extended grace period.
		4.2. 
				The processes in the Pod are sent the [ TERM signal ].
	5. 
		(simultaneous with 3) Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication controllers. Pods that shutdown slowly cannot continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
	6. 
		When the [ grace period expires ], any processes [ still running ] in the Pod are killed with [ SIGKILL ].
	7. 
		The Kubelet will finish deleting the Pod on the API server by setting grace period 0 (immediate deletion). The Pod disappears from the API and is no longer visible from the client.

By default, all deletes are graceful within 30 seconds. The kubectl delete command supports the --grace-period=<seconds> option which allows a user to override the default and specify their own value. 
The value 0 force deletes the pod. In kubectl version >= 1.5, you must specify an additional flag --force along with --grace-period=0 in order to perform force deletions.