kubernetes CRI 前世今生

转载

hagretd 2020-03-11 15:44:39 博主文章分类：k8s

文章标签 k8s cri 文章分类 运维

在学习kubernetes的过程中，我们会遇到CRI、CNI、CSI、OCI 等术语，本文试图先通过分析k8s目前默认的一种容器运行时架构，来帮助我们更好理解k8s 运行时背后设计逻辑。进而引出CRI、OCI的提出背景。

一、k8s 架构

　　我们在构建k8s集群的时候首先需要搭建master节点、其次需要创建node节点并将node节点加入到k8s集群中。当我们构建好k8s集群后，我们可以通过kubectl create -f nginx.yml 命令的方式来创建应用对应的pod。当我们执行命令

后，命令会提交给API server,它会解析yml文件，并将其以API对象的形式存到 etcd里。这时master组件中的Controller Manager会通过控制循环的方式来做编排工作，创建应用所需要的Pod。Scheduler 会 watch etcd中新Pod 的变化。如

果他发现有一个新的Pod 出现，Scheduler会运行调度算法，通过调度算法最终选择出最佳的Node节点，并将这个Node节点的名字写到pod对象的NodeName字段上面，这一步就是所谓的Bind Pod to Node（下图的标注），然后把bind的结果写回到etcd。

其次，当我们在构建k8s集群的时候，默认每个节点上都会初始化创建一个kubelet进程，kubelet进程的会watch etcd中的pod的变化，当kubelet进程watch到pod的bind的更新操作，并且bind的节点是本节点时，它会接管接下来的

所做的事情，如镜像下载，容器创建等。

kubernetes CRI 前世今生_cri

二、k8s 默认容器运行时架构

　　接下来将通过k8s默认集成的容器运行时架构来看kublete如何创建一个容器（如下图所示）。

1. kubelet 通过 CRI(Container Runtime Interface) 接口(gRPC) 调用 dockershim, 请求创建一个容器, 这一步中, Kubelet 可以视作一个简单的 CRI Client, 而 dockershim 就是接收请求的 Server.

2. dockershim 收到请求后, 通过适配的方式，适配成 Docker Daemon 的请求格式, 发到 Docker Daemon 上请求创建一个容器。在docker 1.12后版本中，docker daemon被拆分成dockerd和containerd，containerd负责操作容器。

3. dockerd收到请求后，调用containerd进程去创建一个容器。

4. containerd 收到请求后, 并不会自己直接去操作容器, 而是创建一个叫做 containerd-shim 的进程, 让 containerd-shim 去操作容器. 创建containered-shim的目的主要有：

1）让containerd-shim做诸如收集状态, 维持 stdin 等 fd 打开等工作.

2）允许容器运行时(runC)启动容器后退出，不必为每个容器一直运行一个容器运行时runC。

3）即使在 containerd 和 dockerd 都挂掉的情况下，容器的标准 IO 和其它的文件描述符也都是可用的。

4）向 containerd 报告容器的退出状态

5）在不中断容器运行的情况下升级或重启 dockerd

5. 而containerd-shim 在这一步需要调用 runC 这个命令行工具, 来启动容器，runC是OCI(Open Container Initiative, 开放容器标准) 的一个参考实现。主要用来设置 namespaces 和 cgroups, 挂载 root filesystem等操作。

6.runC启动完容器后本身会直接退出, containerd-shim 则会成为容器进程的父进程, 负责收集容器进程的状态, 上报给 containerd, 并在容器中 pid 为 1 的进程退出后接管容器中的子进程进行清理, 确保不会出现僵尸进程（关闭进程描述符等）。

kubernetes CRI 前世今生_cri_02

三、容器与容器编排背景简述

从k8s的容器运行时可以看出，kubelet启动容器的过程经过了很长的一段调用链路。这个是由于在容器及编排领域各大厂商与docker之间的竞争以及docker公司为了抢占paas领域市场，对架构做出的一系列调整。其实 k8s 最开始

的运行时架构链路调用没有这么复杂: kubelet 想要创建容器直接通过 docker api 调用 Docker Daemon，Docker Daemon 调 libcontainer 这个库来启动容器。为了防止docker垄断以及受控docker运行时, 各大厂商于是就联合起来制定出开

放容器标准OCI(Open Containers Initiative).大家可以基于这个标准开发自己的容器运行时。Docker公司则把 libcontainer做了一层封装, 变成 runC 捐献给CNCF作为 OCI 的参考实现.

接下来就是 Docker 要搞 Swarm 进军 PaaS 市场, 于是做了个架构切分, 把容器操作都移动到一个单独的 Daemon 进程 containerd 中去, 让 Docker Daemon 专门负责上层的封装编排. 最终swarm败给了k8s, 于是

Docker 公司就把 containerd 捐给 CNCF ，专注于搞 Docker 企业版了.

与此同时，容器领域，core os公司推出了个rkt容器运行时。希望 k8s 原生支持 rkt 作为运行时, 由于core os与google的关系，最终rkt运行时的支持在2016年也被合并进kubelet主干代码里. 这样做后反而给k8s中负责维护 kubelet 的小

组 SIG-Node带来了更大的负担，每一次kubelet的更新都要维护docker和rkt两部分代码。与此同时，随着虚拟化技术强隔离容器技术runV(Kata Containers前身，后与intel clear container 合并)的逐渐成熟。k8s上游对虚拟化容器的支持很

快被提上了日程。为了从集成每一种运行时都要维护一份代码中解放出来，k8s SIG-Node工作组决定对容器的操作统一地抽象成一个接口，这样kubelet只需要跟这个接口

打交道，而具体地容器运行时，他们只需要实现该接口，并对kubelet暴露gRPC服务即可。这个统一地抽象地接口就是k8s中俗称的 CRI。

四、CRI（容器运行时接口）

CRI 基于 gRPC 定义了 RuntimeService 和 ImageService 等两个 gRPC 服务，分别用于容器运行时和镜像的管理。如下所示：

// Runtime service defines the public APIs for remote container runtimes

service RuntimeService {

// Version returns the runtime name, runtime version, and runtime API version.

rpc Version(VersionRequest) returns (VersionResponse) {}

// RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure

// the sandbox is in the ready state on success.

rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}

// StopPodSandbox stops any running process that is part of the sandbox and

// reclaims network resources (e.g., IP addresses) allocated to the sandbox.

// If there are any running containers in the sandbox, they must be forcibly

// terminated.

// This call is idempotent, and must not return an error if all relevant

// resources have already been reclaimed. kubelet will call StopPodSandbox

// at least once before calling RemovePodSandbox. It will also attempt to

// reclaim resources eagerly, as soon as a sandbox is not needed. Hence,

// multiple StopPodSandbox calls are expected.

rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}

// RemovePodSandbox removes the sandbox. If there are any running containers

// in the sandbox, they must be forcibly terminated and removed.

// This call is idempotent, and must not return an error if the sandbox has

// already been removed.

rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}

// PodSandboxStatus returns the status of the PodSandbox. If the PodSandbox is not

// present, returns an error.

rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}

// ListPodSandbox returns a list of PodSandboxes.

rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}

// CreateContainer creates a new container in specified PodSandbox

rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}

// StartContainer starts the container.

rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}

// StopContainer stops a running container with a grace period (i.e., timeout).

// This call is idempotent, and must not return an error if the container has

// already been stopped.

// TODO: what must the runtime do after the grace period is reached?

rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}

// RemoveContainer removes the container. If the container is running, the

// container must be forcibly removed.

// This call is idempotent, and must not return an error if the container has

// already been removed.

rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}

// ListContainers lists all containers by filters.

rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}

// ContainerStatus returns status of the container. If the container is not

// present, returns an error.

rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}

// UpdateContainerResources updates ContainerConfig of the container.

rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}

// ReopenContainerLog asks runtime to reopen the stdout/stderr log file

// for the container. This is often called after the log file has been

// rotated. If the container is not running, container runtime can choose

// to either create a new log file and return nil, or return an error.

// Once it returns error, new container log file MUST NOT be created.

rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}

// ExecSync runs a command in a container synchronously.

rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}

// Exec prepares a streaming endpoint to execute a command in the container.

rpc Exec(ExecRequest) returns (ExecResponse) {}

// Attach prepares a streaming endpoint to attach to a running container.

rpc Attach(AttachRequest) returns (AttachResponse) {}

// PortForward prepares a streaming endpoint to forward ports from a PodSandbox.

rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}

// ContainerStats returns stats of the container. If the container does not

// exist, the call returns an error.

rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}

// ListContainerStats returns stats of all running containers.

rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}

// UpdateRuntimeConfig updates the runtime configuration based on the given request.

rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}

// Status returns the status of the runtime.

rpc Status(StatusRequest) returns (StatusResponse) {}

}

// ImageService defines the public APIs for managing images.

service ImageService {

// ListImages lists existing images.

rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}

// ImageStatus returns the status of the image. If the image is not

// present, returns a response with ImageStatusResponse.Image set to

// nil.

rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}

// PullImage pulls an image with authentication config.

rpc PullImage(PullImageRequest) returns (PullImageResponse) {}

// RemoveImage removes the image.

// This call is idempotent, and must not return an error if the image has

// already been removed.

rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}

// ImageFSInfo returns information of the filesystem that is used to store images.

rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}

}

具体容器运行时则需要实现 CRI 定义的接口（即 gRPC server，通常称为 CRI shim）。容器运行时在启动 gRPC server 时需要监听在本地的 Unix Socket （Windows 使用 tcp 格式）。

五、容器运行时实现

　　除了上面介绍的默认的容器运行时的实现，目前容器运行时主要有：

cri-o：同时兼容OCI和CRI的容器运行时
cri-containerd：基于Containerd的Kubernetes CRI 实现
rkt：由CoreOS主推的用来跟docker抗衡的容器运行时
frakti：基于hypervisor的CRI
Clear Containers ：由Intel推出的同时兼容OCI和CRI的容器运行时
Kata Containers：符合OCI规范同时兼容CRI
gVisor：由谷歌推出的容器运行时沙箱(Experimental)

上一篇：Docker graph driver介绍

下一篇：k8s创建容器可以查看deployment但是没有pod创建的异常

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

kubernetes CRI 前世今生

kubernetes CRI 前世今生

51CTO博客