k8s版本1.24

1. cgroup参数介绍

这里介绍一下k8s用到的cgroup的几个参数,以cgroupv1为主
cpu.share
用来设置cgroup中的进程可用的CPU的相对值,在系统不繁忙时,可任意使用CPU资源,不受此值限制,在系统繁忙时,保证进程能用的CPU的最小值。
此值为相对值,且不管是单核还是多核,默认值为1024,最终可用的CPU资源为:本cgroup的cpu.share / 所有cgroup的cpu.share之和,比如在单核系统上,cgroup A和B的cpu.share均为默认值1024,则A和B中的进程都可以使用50%的CPU资源,如果增加了cgroup C,且值为2048,则cgroup A可用25%的CPU资源,cgroup B可用25%的CPU资源,cgroup C可用50%的CPU资源,在多核系统上也是一样的。
最重要的意义是保证cgroup可用的最小CPU资源,比如cgroup A可使用50%的CPU资源,则不管系统多繁忙,都会保证A中的进程都有50%的CPU资源可用,如果系统不繁忙时,A中的进程可使用100%的CPU资源。

在k8s中,通过resources.request.cpu指定了最小可用CPU资源,会通过函数MilliCPUToShares转换成cpu.share值,具体转换规则如下:
cpu.share = (resources.request.cpu * 1024) / 1000

cpu.cfs_period_us
用来设置重新分配cgroup可用CPU资源的时间间隔,即多长时间重新分配CPU资源,相当于一个时间片,单位为微秒,取值范围为1ms到1s(1000-1000000)。

cpu.cfs_quota_us
用来设置在一个时间间隔内所能使用的CPU资源时间,即在一个时间片内,此cgroup可用的CPU资源,如果指定为-1,则不受cgroup限制,最小值为1ms

在k8s中,通过resources.limit.cpu指定了最大可用CPU资源,会通过函数MilliCPUToQuota转换成cpu.cfs_quota_us值,cpu.cfs_period_us可通过参数指定,默认值为100ms。具体转换规则如下cpu.cfs_quota_us = (resources.limit.cpu * cpu.cfs_period_us) / 1000

memory.limit_in_bytes
用来设置cgroup中的进程可用的内存的最大值,如果没指定单位,则默认为字节,也可加后缀表示更大的单位,比如K/M/G,如果指定为-1,则不受cgroup限制

在k8s中,通过resources.request.memory指定了最小可用内存资源,但是在cgroupv1中,不支持设置内存可用最小值,所以没用到此值,在cgroupv2中支持,可打开MemoryQoS后进行设置。
在k8s中,通过resources.limit.memory指定了最大可用内存资源,会转换成memory.limit_in_bytes。

2. pod qos等级

根据resources.request和resources.limit指定的值,可将pod分为如下三个等级:
a. Guaranteed: pod内的所有container都指定了request和limit,且非0并相等
b. Burstable:pod内的有任何container指定了request或者limit
c. BestEffort:pod内的所有container都没有指定request和limit

qos等级的底层实现:
a. 不同qos等级的进程的oom_score_adj值不一样,此值会影响oom_score的最终值,oom_score值越高,则当发生OOM时,对应的进程先被kill掉,
Guaranteed级别的oom_score_adj为-997,Burstable级别的oom_score_adj为3-999,BestEffort级别的oom_score_adj为1000,
具体可参考函数:pkg/kubelet/qos/policy.go:GetContainerOOMScoreAdjust
b. qos底层由cgroup来实现,不同等级的qos表现在不同的cgroup层级上,Guaranteed级别的cgroup在ROOT/kubepods下,
Burstable级别的cgroup在ROOT/kubepods/kubepods-burstable下,BestEffort级别的cgroup在ROOT/kubepods/kubepods-besteffort下。

注意事项:
qos等级不能通过yaml指定,而是自动计算得出,可参考代码:pkg/apis/core/helper/qos/qos.go:GetPodQOS

如果只指定了limit,没指定request,则默认将request的值设置为limit的值
如果同时指定request和limit,request的值不能大于limit

kube-scheduler在调度时,只会根据request进行调度,不会参考limit的值

3. cgroup驱动

支持两种cgroup驱动:cgroupfs和systemd,前者直接操作对应的cgroup文件,后者调用systemd的接口间隔操作。
使用systemd驱动时,cgroup的目录名字需要加上.slice后缀,可参考代码:pkg/kubelet/cm/cgroup_manager_linux.go:ToSystemd

4. kubelet中和cgroup相关的几个参数

a. --cgroups-per-qos: 指定此参数后,会为qos等级和pod创建对应的cgroup层级,默认为true
b. --cgroup-root: 指定root cgroup,默认为/,即/sys/fs/cgroup/,如果同时指定了–cgroups-per-qos,则自动加上kubepods,最终为/kubepods
c. --enforce-node-allocatable: 用来指定是否强制分配,可选值为none,pods,system-reserved和kube-reserved,默认值为pods。
如果指定了system-reserved,则必须指定–system-reserved-cgroup,
如果指定了kube-reserved,则必须指定–kube-reserved-cgroup
d. --system-reserved: 用来指定给系统进程预留的资源,比如cpu=200m,memory=500Mi,ephemeral-storage=1Gi
e. --kube-reserved: 用来指定给k8s组件进程预留的资源,比如cpu=200m,memory=500Mi,ephemeral-storage=1Gi
f. --system-reserved-cgroup: 用来指定给系统进程使用的cgroup绝对路径,会将–system-reserved指定的值设置到此cgroup中,用来限制系统进程能使用的资源,
比如指定的是/kube,如果是systemd驱动,则需要用户提前创建好/sys/fs/cgroup/kube.slice目录
g. --kube-reserved-cgroup: 用来指定给k8s组件进程使用的cgroup绝对路径,会将–kube-reserved指定的值设置到此cgroup中,用来限制k8s组件进程能使用的资源,
比如指定的是/sys,如果是systemd驱动,则需要用户提前创建好/sys/fs/cgroup/sys.slice目录
h. --system-cgroup: 用来指定系统进程使用的cgroup的绝对路径,最好放在–system-reserved-cgroup的层级下面,比如/sys.slice/system,会自动创建指定的路径,
可参考函数:pkg/kubelet/cm/container_manager_linux.go:ensureSystemCgroups,此函数会尝试将所有的非kernel进程和非1进程移到到此cgroup中,
但是使用systemd的系统上,所有进程要么属于kernel进程,要么属于1的子进程,所以即使配置了此参数,也不会有进程移到到此cgroup中
i. --kubelet-cgroup: 用来指定kubelet进程使用的cgroup的绝对路径,最好放在–kube-reserved-cgroup的层级下面,比如/kube.slice/kubelet,
会自动创建指定的路径,可参考函数:pkg/kubelet/cm/container_manager_linux.go:ensureProcessInContainerWithOOMScore,此函数还会设置kubelet进程的
oom_score_adj为-999
j. --qos-reserved: 用来指定为高优先级pod预留资源比例,当前只支持内存,比如指定memory=100%时,当前可分配内存为1G,创建了一个limit 100M的Guaranteed级别pod,
则预留100M给Guaranteed级别pod,Burstable和BestEffort cgroup的memory.limit_in_bytes设置为900M,此时又创建了一个limit 200M的Burstable级别的pod,则BestEffort cgroup的memory.limit_in_bytes设置为700M

5. k8s cgroup层级

kubelet启动后,会在–cgroup-root指定的目录下创建kubepods目录,比如/sys/fs/cgroup/cpu/kubepods,将node上可分配的资源写入kubepods目录下对应的cgroup文件中,比如cpu.share,后面创建的所有pod都会创建在kubepods目录下,以此达到限制pod资源的目的。在kubepods目录下会按照qos级别分三类目录,对于Guaranteed级别的pod,其对应的cgroup直接设置在kubepods目录下,对于Burstable级别的pod,会在kubepods目录下创建kubepods-burstable.slice目录,对应的pod的cgroup设置在kubepods-burstable.slice目录下,对于BestEffort级别的pod,会在kubepods目录下创建kubepods-besteffort.slice目录,对应的pod的cgroup设置在kubepods-besteffort.slice目录下

下面创建三种qos等级的pod,看一下cgroup层级是怎么样的

a. request和limit相等的pod,即Guaranteed级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo1
spec:
  replicas: 1
    spec:
      nodeName: master
      containers:
      - image: nginx:1.14
        imagePullPolicy: IfNotPresent
        name: nginx
        resources:
          requests:
            memory: "128Mi"
            cpu: "500m"
          limits:
            memory: "128Mi"
            cpu: "500m"

b. request大于limit的pod,即Burstable级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo1
spec:
  replicas: 1
    spec:
      nodeName: master
      containers:
      - image: nginx:1.14
        imagePullPolicy: IfNotPresent
        name: nginx
        resources:
          requests:
            memory: "128Mi"
            cpu: "500m"
          limits:
            memory: "256Mi"
            cpu: "1000m"

c. 不指定request和limit的pod,即BestEffort级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo1
spec:
  replicas: 1
    spec:
      nodeName: master
      containers:
      - image: nginx:1.14
        imagePullPolicy: IfNotPresent
        name: nginx

下面为cgroup的内存层级,kubepods.slice目录下包含此node上所有的pod,其memory.limit_in_bytes为2809M限制了pod能用的内存资源,kubepods.slice目录下的三个子目录为三个qos级别的pod,当前只有一个Guaranteed级别的pod,如果有多个的话,就会有多个目录。值得注意的是pod下面的container并不在对应的pod目录下面,而且在system.slice/containerd.service目录下

root@master:/root# tree /sys/fs/cgroup/memory
/sys/fs/cgroup/memory/
├── memory.limit_in_bytes  //9223372036854771712
├── kubepods.slice
│   ├── memory.limit_in_bytes  //2946347008字节/2877292Ki/2809M
│   ├── kubepods-besteffort.slice
│   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   ├── kubepods-besteffort-podde4983ac-ff0c-40be-8472-8b6674593aa3.slice //BestEffort级别的pod
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-burstable.slice
│   │   ├── memory.limit_in_bytes //9223372036854771712最大值,即不在qos级别对内存做限制
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice //Burstable级别的pod
│   │   │   ├── memory.limit_in_bytes //268435456/256M
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice //Guaranteed级别的pod
│   │   ├── memory.limit_in_bytes //134217728/128M
│   │   └── tasks
│   └── tasks
├── kube.slice
│   ├── memory.limit_in_bytes //104857600/100M 为k8s组件预留资源100M
│   ├── kubelet
│   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   └── tasks
│   └── tasks
├── sys.slice
│   ├── memory.limit_in_bytes //104857600/100M 为系统进程预留资源100M
│   └── tasks
├── system.slice
│   ├── memory.limit_in_bytes //9223372036854771712
│   ├── containerd.service
│   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:5a323896aa0db2f15c9f82145cd38851783d08d8bf132f3ed4a7613a3830f71a
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:e6803695024464a3365721812dcff0347c40e162b8142244a527da7b785f215c
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:3c6a7115e688913d0a6d382607f0c1a9b5ecf58d4ee33c9c24e640dc33b80acc
│   │   │   ├── memory.limit_in_bytes //268435456/256M
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:67e2b0336ed2af44875ad7b1fb9c35bae335673cf20a2a1d8331b85d4bea4d95
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:836a0a6aa460663b9a4dc8961dd55da11ae090c9e76705f81e9c7d43060423c3
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:9bbc1d7134d322e988ace0cbb4fc75f44184f4e0f24f1c0228be7eed6ec6f659
│   │   │   ├── memory.limit_in_bytes //134217728/128M
│   │   │   └── tasks
│   └── tasks
├── tasks

下面为cgroup的CPU层级,目录结构和内存层级是一样的,各个层级的cpu.cfs_period_us都是100000

root@master:/root# tree /sys/fs/cgroup/cpu
/sys/fs/cgroup/cpu/
├── cpu.cfs_period_us //100000
├── cpu.cfs_quota_us //-1
├── cpu.shares //1024
├── kubepods.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //7168
│   ├── kubepods-besteffort.slice
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //2
│   │   ├── kubepods-besteffort-podde4983ac-ff0c-40be-8472-8b6674593aa3.slice
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-burstable.slice
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //1546
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //100000
│   │   │   ├── cpu.shares //512
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //50000
│   │   ├── cpu.shares //512
│   │   └── tasks
│   └── tasks
├── kube.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //512
│   ├── kubelet
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //1024
│   │   └── tasks
│   └── tasks
├── sys.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //512
│   └── tasks
├── system.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //1024
│   ├── containerd.service
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //1024
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:5a323896aa0db2f15c9f82145cd38851783d08d8bf132f3ed4a7613a3830f71a
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:e6803695024464a3365721812dcff0347c40e162b8142244a527da7b785f215c
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:3c6a7115e688913d0a6d382607f0c1a9b5ecf58d4ee33c9c24e640dc33b80acc
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //100000
│   │   │   ├── cpu.shares //512
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:67e2b0336ed2af44875ad7b1fb9c35bae335673cf20a2a1d8331b85d4bea4d95
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:836a0a6aa460663b9a4dc8961dd55da11ae090c9e76705f81e9c7d43060423c3
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:9bbc1d7134d322e988ace0cbb4fc75f44184f4e0f24f1c0228be7eed6ec6f659
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //50000
│   │   │   ├── cpu.shares //512
│   │   │   └── tasks
│   └── tasks
├── tasks

6. 资源计算方法

通过上面的cgroup层级,可将k8s用到的cgroup分为四个层级:node level, qos level, pod level和container level,下面看一下各个层级的资源计算方法
a. node level
node层级的目的是为了限制pod能用的总资源,防止无限资源申请抢占node上其他进程的资源,导致node不稳定,这里用到了node allocatable机制,后面文章再详细介绍此机制,现在只看一下如何计算node level所用资源。
node上的总资源用capacity表示,kube-reserved和system-reserved分别为–kube-reserved和–system-reserved指定的预留资源,如果–enforce-node-allocatable指定了pods,则node level可用资源为node level = capacity - kube-reserved - system-reserved,如果–enforce-node-allocatable没指定pods,则node level可用资源就是capacity。

kubepods.slice/cpu.shares = capacity(cpu) - kube-reserved(cpu) - system-reserved(cpu)  //只会转换成cpu.share
kubepods.slice/memory.limit_in_bytes = capacity(memory) - kube-reserved(memory) - system-reserved(memory)

b. qos level
qos level的三个qos等级计算方法不同,Guaranteed级别的pod request和limit相等所以不用计算,对于memory的计算取决于参数–qos-reserved,如果不指定,则无内存限制,下面以指定–qos-reserved为例说明

Burstable级别:
kubepods.slice/kubepods-besteffort.slice/cpu.share = sum request[cpu] of all burstable pod
kubepods.slice/kubepods-besteffort.slice/memory.limit_in_bytes = kubepods.slice/memory.limit_in_bytes - {(sum of requests[memory] of all guaranteed pods)*(reservePercent / 100)}

BestEffort级别:
kubepods.slice/kubepods-besteffort.slice/cpu.share = 2
kubepods.slice/kubepods-besteffort.slice/memory.limit_in_bytes = kubepods.slice/memory.limit_in_bytes - {(sum of requests[memory] of all guaranteed and burstable pods)*(reservePercent / 100)}

c. pod level
不同级别的计算方法如下:

Guaranteed级别:
kubepods.slice/kubepods-pod<UID>/cpu.share = sum request[cpu] of all container
kubepods.slice/kubepods-pod<UID>/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-pod<UID>/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-pod<UID>/memory.limit_in_bytes = sum limit[memory] of all container

Burstable级别:
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/cpu.share = sum request[cpu] of all container
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/memory.limit_in_bytes = sum limit[memory] of all container

BestEffort级别:
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/cpu.share = 2
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/memory.limit_in_bytes = sum limit[memory] of all container

d. container level
不同级别的计算方法如下:

Guaranteed级别:
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.share = request[cpu] of container
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_quota_us = limit[cpu] of container
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/memory.limit_in_bytes = limit[memory] of container

Burstable级别:
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/cpu.share = request[cpu] of container
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_quota_us = limit[cpu] of container
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/memory.limit_in_bytes = limit[memory] of container

BestEffort级别:
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.share = 2
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_quota_us = -1
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/memory.limit_in_bytes = 9223372036854771712

7. 源码分析

这里分析一下各个层级的cgroup何时创建,何时更新
a. node level
调用路径为:containerManagerImpl.start-> setupNode -> createNodeAllocatableCgroups

//代码路径:pkg/kubelet/cm/node_container_manager.go
//createNodeAllocatableCgroups creates Node Allocatable Cgroup when CgroupsPerQOS flag is specified as true
func (cm *containerManagerImpl) createNodeAllocatableCgroups() error {
	//获取node的capacity
	nodeAllocatable := cm.internalCapacity
	// Use Node Allocatable limits instead of capacity if the user requested enforcing node allocatable.
	nc := cm.NodeConfig.NodeAllocatableConfig
	//如果配置了--cgroups-per-qos为true,并且--enforce-node-allocatable指定了pods,则需要减去--system-reserved和--kube-reserved预留的资源
	if cm.CgroupsPerQOS && nc.EnforceNodeAllocatable.Has(kubetypes.NodeAllocatableEnforcementKey) {
		nodeAllocatable = cm.getNodeAllocatableInternalAbsolute()
	}

	cgroupConfig := &CgroupConfig{
		Name: cm.cgroupRoot,
		// The default limits for cpu shares can be very low which can lead to CPU starvation for pods.
		ResourceParameters: getCgroupConfig(nodeAllocatable),
	}
	
	//调用cgroupManager接口判断是否已经存在
	if cm.cgroupManager.Exists(cgroupConfig.Name) {
		return nil
	}
	//如果不存在则创建node level的cgroup目录,比如/sys/fs/cgroup/kubepods.slice
	if err := cm.cgroupManager.Create(cgroupConfig); err != nil {
		klog.ErrorS(err, "Failed to create cgroup", "cgroupName", cm.cgroupRoot)
		return err
	}
	return nil
}

// getNodeAllocatableInternalAbsolute is similar to getNodeAllocatableAbsolute except that
// it also includes internal resources (currently process IDs).  It is intended for setting
// up top level cgroups only.
func (cm *containerManagerImpl) getNodeAllocatableInternalAbsolute() v1.ResourceList {
	return cm.getNodeAllocatableAbsoluteImpl(cm.internalCapacity)
}

func (cm *containerManagerImpl) getNodeAllocatableAbsoluteImpl(capacity v1.ResourceList) v1.ResourceList {
	result := make(v1.ResourceList)
	for k, v := range capacity {
		value := v.DeepCopy()
		if cm.NodeConfig.SystemReserved != nil {
			value.Sub(cm.NodeConfig.SystemReserved[k])
		}
		if cm.NodeConfig.KubeReserved != nil {
			value.Sub(cm.NodeConfig.KubeReserved[k])
		}
		if value.Sign() < 0 {
			// Negative Allocatable resources don't make sense.
			value.Set(0)
		}
		result[k] = value
	}
	return result
}

// getCgroupConfig returns a ResourceConfig object that can be used to create or update cgroups via CgroupManager interface.
func getCgroupConfig(rl v1.ResourceList) *ResourceConfig {
	// TODO(vishh): Set CPU Quota if necessary.
	if rl == nil {
		return nil
	}
	var rc ResourceConfig
	if q, exists := rl[v1.ResourceMemory]; exists {
		// Memory is defined in bytes.
		val := q.Value()
		rc.Memory = &val
	}
	if q, exists := rl[v1.ResourceCPU]; exists {
		// CPU is defined in milli-cores.
		val := MilliCPUToShares(q.MilliValue())
		rc.CpuShares = &val
	}
	if q, exists := rl[pidlimit.PIDs]; exists {
		val := q.Value()
		rc.PidsLimit = &val
	}
	rc.HugePageLimit = HugePageLimits(rl)

	return &rc
}

b. qos level
调用路径为:containerManagerImpl.start-> setupNode -> cm.qosContainerManager.Start

Start创建BestEffort和Burstable级别的cgroup目录,并启动协程,1分钟执行一次UpdateCgroups,更新cgroup的资源值

代码路径:pkg/kubelet/cm/qos_container_manager.go
func (m *qosContainerManagerImpl) Start(getNodeAllocatable func() v1.ResourceList, activePods ActivePodsFunc) error {
	cm := m.cgroupManager
	rootContainer := m.cgroupRoot
	if !cm.Exists(rootContainer) {
		return fmt.Errorf("root container %v doesn't exist", rootContainer)
	}

	// Top level for Qos containers are created only for Burstable
	// and Best Effort classes
	qosClasses := map[v1.PodQOSClass]CgroupName{
		v1.PodQOSBurstable:  NewCgroupName(rootContainer, strings.ToLower(string(v1.PodQOSBurstable))),
		v1.PodQOSBestEffort: NewCgroupName(rootContainer, strings.ToLower(string(v1.PodQOSBestEffort))),
	}

	// Create containers for both qos classes
	for qosClass, containerName := range qosClasses {
		resourceParameters := &ResourceConfig{}
		//对于BestEffort qos来说,cpu.share永远是MinShares,
		//Burstable qos初始值为0,后续会通过UpdateCgroups更新
		// the BestEffort QoS class has a statically configured minShares value
		if qosClass == v1.PodQOSBestEffort {
			minShares := uint64(MinShares)
			resourceParameters.CpuShares = &minShares
		}

		// containerConfig object stores the cgroup specifications
		containerConfig := &CgroupConfig{
			Name:               containerName,
			ResourceParameters: resourceParameters,
		}

		// for each enumerated huge page size, the qos tiers are unbounded
		m.setHugePagesUnbounded(containerConfig)

		//调用cgroupManager接口判断是否已经存在,如果不存在则创建
		// check if it exists
		if !cm.Exists(containerName) {
			if err := cm.Create(containerConfig); err != nil {
				return fmt.Errorf("failed to create top level %v QOS cgroup : %v", qosClass, err)
			}
		} else {
			// to ensure we actually have the right state, we update the config on startup
			if err := cm.Update(containerConfig); err != nil {
				return fmt.Errorf("failed to update top level %v QOS cgroup : %v", qosClass, err)
			}
		}
	}
	// Store the top level qos container names
	m.qosContainersInfo = QOSContainersInfo{
		Guaranteed: rootContainer,
		Burstable:  qosClasses[v1.PodQOSBurstable],
		BestEffort: qosClasses[v1.PodQOSBestEffort],
	}
	m.getNodeAllocatable = getNodeAllocatable
	m.activePods = activePods

	//启动协程,1分钟执行一次UpdateCgroups,更新cgroup的资源值
	// update qos cgroup tiers on startup and in periodic intervals
	// to ensure desired state is in sync with actual state.
	go wait.Until(func() {
		err := m.UpdateCgroups()
		if err != nil {
			klog.InfoS("Failed to reserve QoS requests", "err", err)
		}
	}, periodicQOSCgroupUpdateInterval, wait.NeverStop)

	return nil
}

有两个地方调用UpdateCgroups,一个是上面启动的协程周期性调用,另一个是在syncPod中创建pod时,最终目的都是为了将pod申请的资源信息
累加到对应的qos cgroup中

func (m *qosContainerManagerImpl) UpdateCgroups() error {
	m.Lock()
	defer m.Unlock()

	qosConfigs := map[v1.PodQOSClass]*CgroupConfig{
		v1.PodQOSGuaranteed: {
			Name:               m.qosContainersInfo.Guaranteed,
			ResourceParameters: &ResourceConfig{},
		},
		v1.PodQOSBurstable: {
			Name:               m.qosContainersInfo.Burstable,
			ResourceParameters: &ResourceConfig{},
		},
		v1.PodQOSBestEffort: {
			Name:               m.qosContainersInfo.BestEffort,
			ResourceParameters: &ResourceConfig{},
		},
	}

	//获取所有active pod中Burstable和BestEffort qos级别pod的cpu信息
	// update the qos level cgroup settings for cpu shares
	if err := m.setCPUCgroupConfig(qosConfigs); err != nil {
		return err
	}

	// update the qos level cgroup settings for huge pages (ensure they remain unbounded)
	if err := m.setHugePagesConfig(qosConfigs); err != nil {
		return err
	}

	//cgroupv2的特性,暂时忽略
	// update the qos level cgrougs v2 settings of memory qos if feature enabled
	if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
		libcontainercgroups.IsCgroup2UnifiedMode() {
		m.setMemoryQoS(qosConfigs)
	}

	//如果开启了QOSReserved特性,则获取Burstable和BestEffort qos级别pod的内存信息
	if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.QOSReserved) {
		for resource, percentReserve := range m.qosReserved {
			switch resource {
			case v1.ResourceMemory:
				m.setMemoryReserve(qosConfigs, percentReserve)
			}
		}

		updateSuccess := true
		for _, config := range qosConfigs {
			err := m.cgroupManager.Update(config)
			if err != nil {
				updateSuccess = false
			}
		}
		if updateSuccess {
			klog.V(4).InfoS("Updated QoS cgroup configuration")
			return nil
		}

		// If the resource can adjust the ResourceConfig to increase likelihood of
		// success, call the adjustment function here.  Otherwise, the Update() will
		// be called again with the same values.
		for resource, percentReserve := range m.qosReserved {
			switch resource {
			case v1.ResourceMemory:
				m.retrySetMemoryReserve(qosConfigs, percentReserve)
			}
		}
	}

	//最后更新对应的cgroup
	for _, config := range qosConfigs {
		err := m.cgroupManager.Update(config)
		if err != nil {
			klog.ErrorS(err, "Failed to update QoS cgroup configuration")
			return err
		}
	}

	klog.V(4).InfoS("Updated QoS cgroup configuration")
	return nil
}

func (m *qosContainerManagerImpl) setCPUCgroupConfig(configs map[v1.PodQOSClass]*CgroupConfig) error {
	pods := m.activePods()
	burstablePodCPURequest := int64(0)
	for i := range pods {
		pod := pods[i]
		//获取pod的qos级别
		qosClass := v1qos.GetPodQOS(pod)
		//只关心Burstable级别的pod
		if qosClass != v1.PodQOSBurstable {
			// we only care about the burstable qos tier
			continue
		}
		//累加cpu资源信息
		req, _ := resource.PodRequestsAndLimits(pod)
		if request, found := req[v1.ResourceCPU]; found {
			burstablePodCPURequest += request.MilliValue()
		}
	}

	//BestEffort的cpu.share永远是2
	// make sure best effort is always 2 shares
	bestEffortCPUShares := uint64(MinShares)
	configs[v1.PodQOSBestEffort].ResourceParameters.CpuShares = &bestEffortCPUShares

	// set burstable shares based on current observe state
	burstableCPUShares := MilliCPUToShares(burstablePodCPURequest)
	configs[v1.PodQOSBurstable].ResourceParameters.CpuShares = &burstableCPUShares
	return nil
}

// setMemoryReserve sums the memory limits of all pods in a QOS class,
// calculates QOS class memory limits, and set those limits in the
// CgroupConfig for each QOS class.
func (m *qosContainerManagerImpl) setMemoryReserve(configs map[v1.PodQOSClass]*CgroupConfig, percentReserve int64) {
	qosMemoryRequests := m.getQoSMemoryRequests()

	//getNodeAllocatable为函数 GetNodeAllocatableAbsolute
	resources := m.getNodeAllocatable()
	allocatableResource, ok := resources[v1.ResourceMemory]
	if !ok {
		klog.V(2).InfoS("Allocatable memory value could not be determined, not setting QoS memory limits")
		return
	}
	allocatable := allocatableResource.Value()
	if allocatable == 0 {
		klog.V(2).InfoS("Allocatable memory reported as 0, might be in standalone mode, not setting QoS memory limits")
		return
	}

	for qos, limits := range qosMemoryRequests {
		klog.V(2).InfoS("QoS pod memory limit", "qos", qos, "limits", limits, "percentReserve", percentReserve)
	}

	// Calculate QOS memory limits
	burstableLimit := allocatable - (qosMemoryRequests[v1.PodQOSGuaranteed] * percentReserve / 100)
	bestEffortLimit := burstableLimit - (qosMemoryRequests[v1.PodQOSBurstable] * percentReserve / 100)
	configs[v1.PodQOSBurstable].ResourceParameters.Memory = &burstableLimit
	configs[v1.PodQOSBestEffort].ResourceParameters.Memory = &bestEffortLimit
}

// getQoSMemoryRequests sums and returns the memory request of all pods for
// guaranteed and burstable qos classes.
func (m *qosContainerManagerImpl) getQoSMemoryRequests() map[v1.PodQOSClass]int64 {
	qosMemoryRequests := map[v1.PodQOSClass]int64{
		v1.PodQOSGuaranteed: 0,
		v1.PodQOSBurstable:  0,
	}

	// Sum the pod limits for pods in each QOS class
	pods := m.activePods()
	for _, pod := range pods {
		podMemoryRequest := int64(0)
		qosClass := v1qos.GetPodQOS(pod)
		if qosClass == v1.PodQOSBestEffort {
			// limits are not set for Best Effort pods
			continue
		}
		req, _ := resource.PodRequestsAndLimits(pod)
		if request, found := req[v1.ResourceMemory]; found {
			podMemoryRequest += request.Value()
		}
		qosMemoryRequests[qosClass] += podMemoryRequest
	}

	return qosMemoryRequests
}

// GetNodeAllocatableAbsolute returns the absolute value of Node Allocatable which is primarily useful for enforcement.
// Note that not all resources that are available on the node are included in the returned list of resources.
// Returns a ResourceList.
func (cm *containerManagerImpl) GetNodeAllocatableAbsolute() v1.ResourceList {
	return cm.getNodeAllocatableAbsoluteImpl(cm.capacity)
}

func (cm *containerManagerImpl) getNodeAllocatableAbsoluteImpl(capacity v1.ResourceList) v1.ResourceList {
	result := make(v1.ResourceList)
	for k, v := range capacity {
		value := v.DeepCopy()
		if cm.NodeConfig.SystemReserved != nil {
			value.Sub(cm.NodeConfig.SystemReserved[k])
		}
		if cm.NodeConfig.KubeReserved != nil {
			value.Sub(cm.NodeConfig.KubeReserved[k])
		}
		if value.Sign() < 0 {
			// Negative Allocatable resources don't make sense.
			value.Set(0)
		}
		result[k] = value
	}
	return result
}

c. pod level
在创建pod时,先调用UpdateQOSCgroups更新qos level的cgroup信息,再调用EnsureExists创建pod level的cgroup

func (kl *Kubelet) syncPod
	pcm := kl.containerManager.NewPodContainerManager()
			if !pcm.Exists(pod) {
				//更新 burstable cgroup
				kl.containerManager.UpdateQOSCgroups()

				//在kubepod下创建pod的cgroup
				pcm.EnsureExists(pod
			}

代码路径:pkg/kubelet/cm/pod_container_manager.go
// EnsureExists takes a pod as argument and makes sure that
// pod cgroup exists if qos cgroup hierarchy flag is enabled.
// If the pod level container doesn't already exist it is created.
func (m *podContainerManagerImpl) EnsureExists(pod *v1.Pod) error {
	podContainerName, _ := m.GetPodContainerName(pod)
	// check if container already exist
	alreadyExists := m.Exists(pod)
	if !alreadyExists {
		enforceMemoryQoS := false
		if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
			libcontainercgroups.IsCgroup2UnifiedMode() {
			enforceMemoryQoS = true
		}
		// Create the pod container
		containerConfig := &CgroupConfig{
			Name:               podContainerName,
			ResourceParameters: ResourceConfigForPod(pod, m.enforceCPULimits, m.cpuCFSQuotaPeriod, enforceMemoryQoS),
		}
		if m.podPidsLimit > 0 {
			containerConfig.ResourceParameters.PidsLimit = &m.podPidsLimit
		}
		if enforceMemoryQoS {
			klog.V(4).InfoS("MemoryQoS config for pod", "pod", klog.KObj(pod), "unified", containerConfig.ResourceParameters.Unified)
		}
		if err := m.cgroupManager.Create(containerConfig); err != nil {
			return fmt.Errorf("failed to create container for %v : %v", podContainerName, err)
		}
	}
	return nil
}

代码路径:pkg/kubelet/cm/helper_linux.go
// ResourceConfigForPod takes the input pod and outputs the cgroup resource config.
func ResourceConfigForPod(pod *v1.Pod, enforceCPULimits bool, cpuPeriod uint64, enforceMemoryQoS bool) *ResourceConfig {
	// sum requests and limits.
	reqs, limits := resource.PodRequestsAndLimits(pod)

	cpuRequests := int64(0)
	cpuLimits := int64(0)
	memoryLimits := int64(0)
	if request, found := reqs[v1.ResourceCPU]; found {
		cpuRequests = request.MilliValue()
	}
	if limit, found := limits[v1.ResourceCPU]; found {
		cpuLimits = limit.MilliValue()
	}
	if limit, found := limits[v1.ResourceMemory]; found {
		memoryLimits = limit.Value()
	}

	// convert to CFS values
	cpuShares := MilliCPUToShares(cpuRequests)
	cpuQuota := MilliCPUToQuota(cpuLimits, int64(cpuPeriod))

	// track if limits were applied for each resource.
	memoryLimitsDeclared := true
	cpuLimitsDeclared := true
	// map hugepage pagesize (bytes) to limits (bytes)
	hugePageLimits := map[int64]int64{}
	for _, container := range pod.Spec.Containers {
		if container.Resources.Limits.Cpu().IsZero() {
			cpuLimitsDeclared = false
		}
		if container.Resources.Limits.Memory().IsZero() {
			memoryLimitsDeclared = false
		}
		containerHugePageLimits := HugePageLimits(container.Resources.Requests)
		for k, v := range containerHugePageLimits {
			if value, exists := hugePageLimits[k]; exists {
				hugePageLimits[k] = value + v
			} else {
				hugePageLimits[k] = v
			}
		}
	}

	for _, container := range pod.Spec.InitContainers {
		if container.Resources.Limits.Cpu().IsZero() {
			cpuLimitsDeclared = false
		}
		if container.Resources.Limits.Memory().IsZero() {
			memoryLimitsDeclared = false
		}
		containerHugePageLimits := HugePageLimits(container.Resources.Requests)
		for k, v := range containerHugePageLimits {
			if value, exists := hugePageLimits[k]; !exists || v > value {
				hugePageLimits[k] = v
			}
		}
	}

	// quota is not capped when cfs quota is disabled
	if !enforceCPULimits {
		cpuQuota = int64(-1)
	}

	// determine the qos class
	qosClass := v1qos.GetPodQOS(pod)

	// build the result
	result := &ResourceConfig{}
	if qosClass == v1.PodQOSGuaranteed {
		result.CpuShares = &cpuShares
		result.CpuQuota = &cpuQuota
		result.CpuPeriod = &cpuPeriod
		result.Memory = &memoryLimits
	} else if qosClass == v1.PodQOSBurstable {
		result.CpuShares = &cpuShares
		if cpuLimitsDeclared {
			result.CpuQuota = &cpuQuota
			result.CpuPeriod = &cpuPeriod
		}
		if memoryLimitsDeclared {
			result.Memory = &memoryLimits
		}
	} else {
		shares := uint64(MinShares)
		result.CpuShares = &shares
	}
	result.HugePageLimit = hugePageLimits

	if enforceMemoryQoS {
		memoryMin := int64(0)
		if request, found := reqs[v1.ResourceMemory]; found {
			memoryMin = request.Value()
		}
		if memoryMin > 0 {
			result.Unified = map[string]string{
				MemoryMin: strconv.FormatInt(memoryMin, 10),
			}
		}
	}

	return result
}

d. container level
调用路径startContainer -> generateContainerConfig -> applyPlatformSpecificContainerConfig -> generateLinuxContainerConfig

在函数generateLinuxContainerConfig中,将用户配置的request和limit资源信息转换到container的配置,最终会传给containerd来创建container和container level cgroup

代码路径:pkg/kubelet/kuberuntime/kuberuntime_container_linux.go
// generateLinuxContainerConfig generates linux container config for kubelet runtime v1.
func (m *kubeGenericRuntimeManager) generateLinuxContainerConfig(container *v1.Container, pod *v1.Pod, uid *int64, username string, nsTarget *kubecontainer.ContainerID, enforceMemoryQoS bool) *runtimeapi.LinuxContainerConfig {
	...
	// set linux container resources
	var cpuShares int64
	cpuRequest := container.Resources.Requests.Cpu()
	cpuLimit := container.Resources.Limits.Cpu()
	memoryLimit := container.Resources.Limits.Memory().Value()
	memoryRequest := container.Resources.Requests.Memory().Value()
	oomScoreAdj := int64(qos.GetContainerOOMScoreAdjust(pod, container,
		int64(m.machineInfo.MemoryCapacity)))
	// If request is not specified, but limit is, we want request to default to limit.
	// API server does this for new containers, but we repeat this logic in Kubelet
	// for containers running on existing Kubernetes clusters.
	if cpuRequest.IsZero() && !cpuLimit.IsZero() {
		cpuShares = milliCPUToShares(cpuLimit.MilliValue())
	} else {
		// if cpuRequest.Amount is nil, then milliCPUToShares will return the minimal number
		// of CPU shares.
		cpuShares = milliCPUToShares(cpuRequest.MilliValue())
	}
	lc.Resources.CpuShares = cpuShares
	if memoryLimit != 0 {
		lc.Resources.MemoryLimitInBytes = memoryLimit
	}
	// Set OOM score of the container based on qos policy. Processes in lower-priority pods should
	// be killed first if the system runs out of memory.
	lc.Resources.OomScoreAdj = oomScoreAdj

	if m.cpuCFSQuota {
		// if cpuLimit.Amount is nil, then the appropriate default value is returned
		// to allow full usage of cpu resource.
		cpuPeriod := int64(quotaPeriod)
		if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.CPUCFSQuotaPeriod) {
			cpuPeriod = int64(m.cpuCFSQuotaPeriod.Duration / time.Microsecond)
		}
		cpuQuota := milliCPUToQuota(cpuLimit.MilliValue(), cpuPeriod)
		lc.Resources.CpuQuota = cpuQuota
		lc.Resources.CpuPeriod = cpuPeriod
	}

	...

	return lc
}

参考
https://github.com/kubernetes/design-proposals-archive/blob/main/node/node-allocatable.mdhttps://github.com/kubernetes/design-proposals-archive/blob/main/node/resource-qos.mdhttps://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/7/html/resource_management_guide/sec-memory