k8s版本1.24
1. cgroup参数介绍
这里介绍一下k8s用到的cgroup的几个参数,以cgroupv1为主
cpu.share
用来设置cgroup中的进程可用的CPU的相对值,在系统不繁忙时,可任意使用CPU资源,不受此值限制,在系统繁忙时,保证进程能用的CPU的最小值。
此值为相对值,且不管是单核还是多核,默认值为1024,最终可用的CPU资源为:本cgroup的cpu.share / 所有cgroup的cpu.share之和,比如在单核系统上,cgroup A和B的cpu.share均为默认值1024,则A和B中的进程都可以使用50%的CPU资源,如果增加了cgroup C,且值为2048,则cgroup A可用25%的CPU资源,cgroup B可用25%的CPU资源,cgroup C可用50%的CPU资源,在多核系统上也是一样的。
最重要的意义是保证cgroup可用的最小CPU资源,比如cgroup A可使用50%的CPU资源,则不管系统多繁忙,都会保证A中的进程都有50%的CPU资源可用,如果系统不繁忙时,A中的进程可使用100%的CPU资源。
在k8s中,通过resources.request.cpu指定了最小可用CPU资源,会通过函数MilliCPUToShares转换成cpu.share值,具体转换规则如下:
cpu.share = (resources.request.cpu * 1024) / 1000
cpu.cfs_period_us
用来设置重新分配cgroup可用CPU资源的时间间隔,即多长时间重新分配CPU资源,相当于一个时间片,单位为微秒,取值范围为1ms到1s(1000-1000000)。
cpu.cfs_quota_us
用来设置在一个时间间隔内所能使用的CPU资源时间,即在一个时间片内,此cgroup可用的CPU资源,如果指定为-1,则不受cgroup限制,最小值为1ms
在k8s中,通过resources.limit.cpu指定了最大可用CPU资源,会通过函数MilliCPUToQuota转换成cpu.cfs_quota_us值,cpu.cfs_period_us可通过参数指定,默认值为100ms。具体转换规则如下cpu.cfs_quota_us = (resources.limit.cpu * cpu.cfs_period_us) / 1000
memory.limit_in_bytes
用来设置cgroup中的进程可用的内存的最大值,如果没指定单位,则默认为字节,也可加后缀表示更大的单位,比如K/M/G,如果指定为-1,则不受cgroup限制
在k8s中,通过resources.request.memory指定了最小可用内存资源,但是在cgroupv1中,不支持设置内存可用最小值,所以没用到此值,在cgroupv2中支持,可打开MemoryQoS后进行设置。
在k8s中,通过resources.limit.memory指定了最大可用内存资源,会转换成memory.limit_in_bytes。
2. pod qos等级
根据resources.request和resources.limit指定的值,可将pod分为如下三个等级:
a. Guaranteed: pod内的所有container都指定了request和limit,且非0并相等
b. Burstable:pod内的有任何container指定了request或者limit
c. BestEffort:pod内的所有container都没有指定request和limit
qos等级的底层实现:
a. 不同qos等级的进程的oom_score_adj值不一样,此值会影响oom_score的最终值,oom_score值越高,则当发生OOM时,对应的进程先被kill掉,
Guaranteed级别的oom_score_adj为-997,Burstable级别的oom_score_adj为3-999,BestEffort级别的oom_score_adj为1000,
具体可参考函数:pkg/kubelet/qos/policy.go:GetContainerOOMScoreAdjust
b. qos底层由cgroup来实现,不同等级的qos表现在不同的cgroup层级上,Guaranteed级别的cgroup在ROOT/kubepods下,
Burstable级别的cgroup在ROOT/kubepods/kubepods-burstable下,BestEffort级别的cgroup在ROOT/kubepods/kubepods-besteffort下。
注意事项:
qos等级不能通过yaml指定,而是自动计算得出,可参考代码:pkg/apis/core/helper/qos/qos.go:GetPodQOS
如果只指定了limit,没指定request,则默认将request的值设置为limit的值
如果同时指定request和limit,request的值不能大于limit
kube-scheduler在调度时,只会根据request进行调度,不会参考limit的值
3. cgroup驱动
支持两种cgroup驱动:cgroupfs和systemd,前者直接操作对应的cgroup文件,后者调用systemd的接口间隔操作。
使用systemd驱动时,cgroup的目录名字需要加上.slice后缀,可参考代码:pkg/kubelet/cm/cgroup_manager_linux.go:ToSystemd
4. kubelet中和cgroup相关的几个参数
a. --cgroups-per-qos: 指定此参数后,会为qos等级和pod创建对应的cgroup层级,默认为true
b. --cgroup-root: 指定root cgroup,默认为/,即/sys/fs/cgroup/,如果同时指定了–cgroups-per-qos,则自动加上kubepods,最终为/kubepods
c. --enforce-node-allocatable: 用来指定是否强制分配,可选值为none,pods,system-reserved和kube-reserved,默认值为pods。
如果指定了system-reserved,则必须指定–system-reserved-cgroup,
如果指定了kube-reserved,则必须指定–kube-reserved-cgroup
d. --system-reserved: 用来指定给系统进程预留的资源,比如cpu=200m,memory=500Mi,ephemeral-storage=1Gi
e. --kube-reserved: 用来指定给k8s组件进程预留的资源,比如cpu=200m,memory=500Mi,ephemeral-storage=1Gi
f. --system-reserved-cgroup: 用来指定给系统进程使用的cgroup绝对路径,会将–system-reserved指定的值设置到此cgroup中,用来限制系统进程能使用的资源,
比如指定的是/kube,如果是systemd驱动,则需要用户提前创建好/sys/fs/cgroup/kube.slice目录
g. --kube-reserved-cgroup: 用来指定给k8s组件进程使用的cgroup绝对路径,会将–kube-reserved指定的值设置到此cgroup中,用来限制k8s组件进程能使用的资源,
比如指定的是/sys,如果是systemd驱动,则需要用户提前创建好/sys/fs/cgroup/sys.slice目录
h. --system-cgroup: 用来指定系统进程使用的cgroup的绝对路径,最好放在–system-reserved-cgroup的层级下面,比如/sys.slice/system,会自动创建指定的路径,
可参考函数:pkg/kubelet/cm/container_manager_linux.go:ensureSystemCgroups,此函数会尝试将所有的非kernel进程和非1进程移到到此cgroup中,
但是使用systemd的系统上,所有进程要么属于kernel进程,要么属于1的子进程,所以即使配置了此参数,也不会有进程移到到此cgroup中
i. --kubelet-cgroup: 用来指定kubelet进程使用的cgroup的绝对路径,最好放在–kube-reserved-cgroup的层级下面,比如/kube.slice/kubelet,
会自动创建指定的路径,可参考函数:pkg/kubelet/cm/container_manager_linux.go:ensureProcessInContainerWithOOMScore,此函数还会设置kubelet进程的
oom_score_adj为-999
j. --qos-reserved: 用来指定为高优先级pod预留资源比例,当前只支持内存,比如指定memory=100%时,当前可分配内存为1G,创建了一个limit 100M的Guaranteed级别pod,
则预留100M给Guaranteed级别pod,Burstable和BestEffort cgroup的memory.limit_in_bytes设置为900M,此时又创建了一个limit 200M的Burstable级别的pod,则BestEffort cgroup的memory.limit_in_bytes设置为700M
5. k8s cgroup层级
kubelet启动后,会在–cgroup-root指定的目录下创建kubepods目录,比如/sys/fs/cgroup/cpu/kubepods,将node上可分配的资源写入kubepods目录下对应的cgroup文件中,比如cpu.share,后面创建的所有pod都会创建在kubepods目录下,以此达到限制pod资源的目的。在kubepods目录下会按照qos级别分三类目录,对于Guaranteed级别的pod,其对应的cgroup直接设置在kubepods目录下,对于Burstable级别的pod,会在kubepods目录下创建kubepods-burstable.slice目录,对应的pod的cgroup设置在kubepods-burstable.slice目录下,对于BestEffort级别的pod,会在kubepods目录下创建kubepods-besteffort.slice目录,对应的pod的cgroup设置在kubepods-besteffort.slice目录下
下面创建三种qos等级的pod,看一下cgroup层级是怎么样的
a. request和limit相等的pod,即Guaranteed级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo1
spec:
replicas: 1
spec:
nodeName: master
containers:
- image: nginx:1.14
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
memory: "128Mi"
cpu: "500m"
limits:
memory: "128Mi"
cpu: "500m"
b. request大于limit的pod,即Burstable级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo1
spec:
replicas: 1
spec:
nodeName: master
containers:
- image: nginx:1.14
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
memory: "128Mi"
cpu: "500m"
limits:
memory: "256Mi"
cpu: "1000m"
c. 不指定request和limit的pod,即BestEffort级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo1
spec:
replicas: 1
spec:
nodeName: master
containers:
- image: nginx:1.14
imagePullPolicy: IfNotPresent
name: nginx
下面为cgroup的内存层级,kubepods.slice目录下包含此node上所有的pod,其memory.limit_in_bytes为2809M限制了pod能用的内存资源,kubepods.slice目录下的三个子目录为三个qos级别的pod,当前只有一个Guaranteed级别的pod,如果有多个的话,就会有多个目录。值得注意的是pod下面的container并不在对应的pod目录下面,而且在system.slice/containerd.service目录下
root@master:/root# tree /sys/fs/cgroup/memory
/sys/fs/cgroup/memory/
├── memory.limit_in_bytes //9223372036854771712
├── kubepods.slice
│ ├── memory.limit_in_bytes //2946347008字节/2877292Ki/2809M
│ ├── kubepods-besteffort.slice
│ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ ├── kubepods-besteffort-podde4983ac-ff0c-40be-8472-8b6674593aa3.slice //BestEffort级别的pod
│ │ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ │ └── tasks
│ │ └── tasks
│ ├── kubepods-burstable.slice
│ │ ├── memory.limit_in_bytes //9223372036854771712最大值,即不在qos级别对内存做限制
│ │ ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice //Burstable级别的pod
│ │ │ ├── memory.limit_in_bytes //268435456/256M
│ │ │ └── tasks
│ │ └── tasks
│ ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice //Guaranteed级别的pod
│ │ ├── memory.limit_in_bytes //134217728/128M
│ │ └── tasks
│ └── tasks
├── kube.slice
│ ├── memory.limit_in_bytes //104857600/100M 为k8s组件预留资源100M
│ ├── kubelet
│ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ └── tasks
│ └── tasks
├── sys.slice
│ ├── memory.limit_in_bytes //104857600/100M 为系统进程预留资源100M
│ └── tasks
├── system.slice
│ ├── memory.limit_in_bytes //9223372036854771712
│ ├── containerd.service
│ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:5a323896aa0db2f15c9f82145cd38851783d08d8bf132f3ed4a7613a3830f71a
│ │ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ │ └── tasks
│ │ ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:e6803695024464a3365721812dcff0347c40e162b8142244a527da7b785f215c
│ │ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ │ └── tasks
│ │ ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:3c6a7115e688913d0a6d382607f0c1a9b5ecf58d4ee33c9c24e640dc33b80acc
│ │ │ ├── memory.limit_in_bytes //268435456/256M
│ │ │ └── tasks
│ │ ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:67e2b0336ed2af44875ad7b1fb9c35bae335673cf20a2a1d8331b85d4bea4d95
│ │ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ │ └── tasks
│ │ ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:836a0a6aa460663b9a4dc8961dd55da11ae090c9e76705f81e9c7d43060423c3
│ │ │ ├── memory.limit_in_bytes //9223372036854771712
│ │ │ └── tasks
│ │ ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:9bbc1d7134d322e988ace0cbb4fc75f44184f4e0f24f1c0228be7eed6ec6f659
│ │ │ ├── memory.limit_in_bytes //134217728/128M
│ │ │ └── tasks
│ └── tasks
├── tasks
下面为cgroup的CPU层级,目录结构和内存层级是一样的,各个层级的cpu.cfs_period_us都是100000
root@master:/root# tree /sys/fs/cgroup/cpu
/sys/fs/cgroup/cpu/
├── cpu.cfs_period_us //100000
├── cpu.cfs_quota_us //-1
├── cpu.shares //1024
├── kubepods.slice
│ ├── cpu.cfs_period_us
│ ├── cpu.cfs_quota_us //-1
│ ├── cpu.shares //7168
│ ├── kubepods-besteffort.slice
│ │ ├── cpu.cfs_period_us
│ │ ├── cpu.cfs_quota_us //-1
│ │ ├── cpu.shares //2
│ │ ├── kubepods-besteffort-podde4983ac-ff0c-40be-8472-8b6674593aa3.slice
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us
│ │ │ ├── cpu.shares //2
│ │ │ └── tasks
│ │ └── tasks
│ ├── kubepods-burstable.slice
│ │ ├── cpu.cfs_period_us
│ │ ├── cpu.cfs_quota_us //-1
│ │ ├── cpu.shares //1546
│ │ ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us //100000
│ │ │ ├── cpu.shares //512
│ │ │ └── tasks
│ │ └── tasks
│ ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice
│ │ ├── cpu.cfs_period_us
│ │ ├── cpu.cfs_quota_us //50000
│ │ ├── cpu.shares //512
│ │ └── tasks
│ └── tasks
├── kube.slice
│ ├── cpu.cfs_period_us
│ ├── cpu.cfs_quota_us //-1
│ ├── cpu.shares //512
│ ├── kubelet
│ │ ├── cpu.cfs_period_us
│ │ ├── cpu.cfs_quota_us //-1
│ │ ├── cpu.shares //1024
│ │ └── tasks
│ └── tasks
├── sys.slice
│ ├── cpu.cfs_period_us
│ ├── cpu.cfs_quota_us //-1
│ ├── cpu.shares //512
│ └── tasks
├── system.slice
│ ├── cpu.cfs_period_us
│ ├── cpu.cfs_quota_us //-1
│ ├── cpu.shares //1024
│ ├── containerd.service
│ │ ├── cpu.cfs_period_us
│ │ ├── cpu.cfs_quota_us //-1
│ │ ├── cpu.shares //1024
│ │ ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:5a323896aa0db2f15c9f82145cd38851783d08d8bf132f3ed4a7613a3830f71a
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us //-1
│ │ │ ├── cpu.shares //2
│ │ │ └── tasks
│ │ ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:e6803695024464a3365721812dcff0347c40e162b8142244a527da7b785f215c
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us //-1
│ │ │ ├── cpu.shares //2
│ │ │ └── tasks
│ │ ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:3c6a7115e688913d0a6d382607f0c1a9b5ecf58d4ee33c9c24e640dc33b80acc
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us //100000
│ │ │ ├── cpu.shares //512
│ │ │ └── tasks
│ │ ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:67e2b0336ed2af44875ad7b1fb9c35bae335673cf20a2a1d8331b85d4bea4d95
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us //-1
│ │ │ ├── cpu.shares //2
│ │ │ └── tasks
│ │ ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:836a0a6aa460663b9a4dc8961dd55da11ae090c9e76705f81e9c7d43060423c3
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us //-1
│ │ │ ├── cpu.shares //2
│ │ │ └── tasks
│ │ ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:9bbc1d7134d322e988ace0cbb4fc75f44184f4e0f24f1c0228be7eed6ec6f659
│ │ │ ├── cpu.cfs_period_us
│ │ │ ├── cpu.cfs_quota_us //50000
│ │ │ ├── cpu.shares //512
│ │ │ └── tasks
│ └── tasks
├── tasks
6. 资源计算方法
通过上面的cgroup层级,可将k8s用到的cgroup分为四个层级:node level, qos level, pod level和container level,下面看一下各个层级的资源计算方法
a. node level
node层级的目的是为了限制pod能用的总资源,防止无限资源申请抢占node上其他进程的资源,导致node不稳定,这里用到了node allocatable机制,后面文章再详细介绍此机制,现在只看一下如何计算node level所用资源。
node上的总资源用capacity表示,kube-reserved和system-reserved分别为–kube-reserved和–system-reserved指定的预留资源,如果–enforce-node-allocatable指定了pods,则node level可用资源为node level = capacity - kube-reserved - system-reserved,如果–enforce-node-allocatable没指定pods,则node level可用资源就是capacity。
kubepods.slice/cpu.shares = capacity(cpu) - kube-reserved(cpu) - system-reserved(cpu) //只会转换成cpu.share
kubepods.slice/memory.limit_in_bytes = capacity(memory) - kube-reserved(memory) - system-reserved(memory)
b. qos level
qos level的三个qos等级计算方法不同,Guaranteed级别的pod request和limit相等所以不用计算,对于memory的计算取决于参数–qos-reserved,如果不指定,则无内存限制,下面以指定–qos-reserved为例说明
Burstable级别:
kubepods.slice/kubepods-besteffort.slice/cpu.share = sum request[cpu] of all burstable pod
kubepods.slice/kubepods-besteffort.slice/memory.limit_in_bytes = kubepods.slice/memory.limit_in_bytes - {(sum of requests[memory] of all guaranteed pods)*(reservePercent / 100)}
BestEffort级别:
kubepods.slice/kubepods-besteffort.slice/cpu.share = 2
kubepods.slice/kubepods-besteffort.slice/memory.limit_in_bytes = kubepods.slice/memory.limit_in_bytes - {(sum of requests[memory] of all guaranteed and burstable pods)*(reservePercent / 100)}
c. pod level
不同级别的计算方法如下:
Guaranteed级别:
kubepods.slice/kubepods-pod<UID>/cpu.share = sum request[cpu] of all container
kubepods.slice/kubepods-pod<UID>/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-pod<UID>/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-pod<UID>/memory.limit_in_bytes = sum limit[memory] of all container
Burstable级别:
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/cpu.share = sum request[cpu] of all container
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>/memory.limit_in_bytes = sum limit[memory] of all container
BestEffort级别:
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/cpu.share = 2
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod<UID>/memory.limit_in_bytes = sum limit[memory] of all container
d. container level
不同级别的计算方法如下:
Guaranteed级别:
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.share = request[cpu] of container
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_quota_us = limit[cpu] of container
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/memory.limit_in_bytes = limit[memory] of container
Burstable级别:
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/cpu.share = request[cpu] of container
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_quota_us = limit[cpu] of container
system.slice/containerd.service/kubepods-burstable-pod<UID>.slice:cri-containerd:<UID>/memory.limit_in_bytes = limit[memory] of container
BestEffort级别:
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.share = 2
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/cpu.cfs_quota_us = -1
system.slice/containerd.service/kubepods-besteffort-pod<UID>.slice:cri-containerd:<UID>/memory.limit_in_bytes = 9223372036854771712
7. 源码分析
这里分析一下各个层级的cgroup何时创建,何时更新
a. node level
调用路径为:containerManagerImpl.start-> setupNode -> createNodeAllocatableCgroups
//代码路径:pkg/kubelet/cm/node_container_manager.go
//createNodeAllocatableCgroups creates Node Allocatable Cgroup when CgroupsPerQOS flag is specified as true
func (cm *containerManagerImpl) createNodeAllocatableCgroups() error {
//获取node的capacity
nodeAllocatable := cm.internalCapacity
// Use Node Allocatable limits instead of capacity if the user requested enforcing node allocatable.
nc := cm.NodeConfig.NodeAllocatableConfig
//如果配置了--cgroups-per-qos为true,并且--enforce-node-allocatable指定了pods,则需要减去--system-reserved和--kube-reserved预留的资源
if cm.CgroupsPerQOS && nc.EnforceNodeAllocatable.Has(kubetypes.NodeAllocatableEnforcementKey) {
nodeAllocatable = cm.getNodeAllocatableInternalAbsolute()
}
cgroupConfig := &CgroupConfig{
Name: cm.cgroupRoot,
// The default limits for cpu shares can be very low which can lead to CPU starvation for pods.
ResourceParameters: getCgroupConfig(nodeAllocatable),
}
//调用cgroupManager接口判断是否已经存在
if cm.cgroupManager.Exists(cgroupConfig.Name) {
return nil
}
//如果不存在则创建node level的cgroup目录,比如/sys/fs/cgroup/kubepods.slice
if err := cm.cgroupManager.Create(cgroupConfig); err != nil {
klog.ErrorS(err, "Failed to create cgroup", "cgroupName", cm.cgroupRoot)
return err
}
return nil
}
// getNodeAllocatableInternalAbsolute is similar to getNodeAllocatableAbsolute except that
// it also includes internal resources (currently process IDs). It is intended for setting
// up top level cgroups only.
func (cm *containerManagerImpl) getNodeAllocatableInternalAbsolute() v1.ResourceList {
return cm.getNodeAllocatableAbsoluteImpl(cm.internalCapacity)
}
func (cm *containerManagerImpl) getNodeAllocatableAbsoluteImpl(capacity v1.ResourceList) v1.ResourceList {
result := make(v1.ResourceList)
for k, v := range capacity {
value := v.DeepCopy()
if cm.NodeConfig.SystemReserved != nil {
value.Sub(cm.NodeConfig.SystemReserved[k])
}
if cm.NodeConfig.KubeReserved != nil {
value.Sub(cm.NodeConfig.KubeReserved[k])
}
if value.Sign() < 0 {
// Negative Allocatable resources don't make sense.
value.Set(0)
}
result[k] = value
}
return result
}
// getCgroupConfig returns a ResourceConfig object that can be used to create or update cgroups via CgroupManager interface.
func getCgroupConfig(rl v1.ResourceList) *ResourceConfig {
// TODO(vishh): Set CPU Quota if necessary.
if rl == nil {
return nil
}
var rc ResourceConfig
if q, exists := rl[v1.ResourceMemory]; exists {
// Memory is defined in bytes.
val := q.Value()
rc.Memory = &val
}
if q, exists := rl[v1.ResourceCPU]; exists {
// CPU is defined in milli-cores.
val := MilliCPUToShares(q.MilliValue())
rc.CpuShares = &val
}
if q, exists := rl[pidlimit.PIDs]; exists {
val := q.Value()
rc.PidsLimit = &val
}
rc.HugePageLimit = HugePageLimits(rl)
return &rc
}
b. qos level
调用路径为:containerManagerImpl.start-> setupNode -> cm.qosContainerManager.Start
Start创建BestEffort和Burstable级别的cgroup目录,并启动协程,1分钟执行一次UpdateCgroups,更新cgroup的资源值
代码路径:pkg/kubelet/cm/qos_container_manager.go
func (m *qosContainerManagerImpl) Start(getNodeAllocatable func() v1.ResourceList, activePods ActivePodsFunc) error {
cm := m.cgroupManager
rootContainer := m.cgroupRoot
if !cm.Exists(rootContainer) {
return fmt.Errorf("root container %v doesn't exist", rootContainer)
}
// Top level for Qos containers are created only for Burstable
// and Best Effort classes
qosClasses := map[v1.PodQOSClass]CgroupName{
v1.PodQOSBurstable: NewCgroupName(rootContainer, strings.ToLower(string(v1.PodQOSBurstable))),
v1.PodQOSBestEffort: NewCgroupName(rootContainer, strings.ToLower(string(v1.PodQOSBestEffort))),
}
// Create containers for both qos classes
for qosClass, containerName := range qosClasses {
resourceParameters := &ResourceConfig{}
//对于BestEffort qos来说,cpu.share永远是MinShares,
//Burstable qos初始值为0,后续会通过UpdateCgroups更新
// the BestEffort QoS class has a statically configured minShares value
if qosClass == v1.PodQOSBestEffort {
minShares := uint64(MinShares)
resourceParameters.CpuShares = &minShares
}
// containerConfig object stores the cgroup specifications
containerConfig := &CgroupConfig{
Name: containerName,
ResourceParameters: resourceParameters,
}
// for each enumerated huge page size, the qos tiers are unbounded
m.setHugePagesUnbounded(containerConfig)
//调用cgroupManager接口判断是否已经存在,如果不存在则创建
// check if it exists
if !cm.Exists(containerName) {
if err := cm.Create(containerConfig); err != nil {
return fmt.Errorf("failed to create top level %v QOS cgroup : %v", qosClass, err)
}
} else {
// to ensure we actually have the right state, we update the config on startup
if err := cm.Update(containerConfig); err != nil {
return fmt.Errorf("failed to update top level %v QOS cgroup : %v", qosClass, err)
}
}
}
// Store the top level qos container names
m.qosContainersInfo = QOSContainersInfo{
Guaranteed: rootContainer,
Burstable: qosClasses[v1.PodQOSBurstable],
BestEffort: qosClasses[v1.PodQOSBestEffort],
}
m.getNodeAllocatable = getNodeAllocatable
m.activePods = activePods
//启动协程,1分钟执行一次UpdateCgroups,更新cgroup的资源值
// update qos cgroup tiers on startup and in periodic intervals
// to ensure desired state is in sync with actual state.
go wait.Until(func() {
err := m.UpdateCgroups()
if err != nil {
klog.InfoS("Failed to reserve QoS requests", "err", err)
}
}, periodicQOSCgroupUpdateInterval, wait.NeverStop)
return nil
}
有两个地方调用UpdateCgroups,一个是上面启动的协程周期性调用,另一个是在syncPod中创建pod时,最终目的都是为了将pod申请的资源信息
累加到对应的qos cgroup中
func (m *qosContainerManagerImpl) UpdateCgroups() error {
m.Lock()
defer m.Unlock()
qosConfigs := map[v1.PodQOSClass]*CgroupConfig{
v1.PodQOSGuaranteed: {
Name: m.qosContainersInfo.Guaranteed,
ResourceParameters: &ResourceConfig{},
},
v1.PodQOSBurstable: {
Name: m.qosContainersInfo.Burstable,
ResourceParameters: &ResourceConfig{},
},
v1.PodQOSBestEffort: {
Name: m.qosContainersInfo.BestEffort,
ResourceParameters: &ResourceConfig{},
},
}
//获取所有active pod中Burstable和BestEffort qos级别pod的cpu信息
// update the qos level cgroup settings for cpu shares
if err := m.setCPUCgroupConfig(qosConfigs); err != nil {
return err
}
// update the qos level cgroup settings for huge pages (ensure they remain unbounded)
if err := m.setHugePagesConfig(qosConfigs); err != nil {
return err
}
//cgroupv2的特性,暂时忽略
// update the qos level cgrougs v2 settings of memory qos if feature enabled
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
libcontainercgroups.IsCgroup2UnifiedMode() {
m.setMemoryQoS(qosConfigs)
}
//如果开启了QOSReserved特性,则获取Burstable和BestEffort qos级别pod的内存信息
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.QOSReserved) {
for resource, percentReserve := range m.qosReserved {
switch resource {
case v1.ResourceMemory:
m.setMemoryReserve(qosConfigs, percentReserve)
}
}
updateSuccess := true
for _, config := range qosConfigs {
err := m.cgroupManager.Update(config)
if err != nil {
updateSuccess = false
}
}
if updateSuccess {
klog.V(4).InfoS("Updated QoS cgroup configuration")
return nil
}
// If the resource can adjust the ResourceConfig to increase likelihood of
// success, call the adjustment function here. Otherwise, the Update() will
// be called again with the same values.
for resource, percentReserve := range m.qosReserved {
switch resource {
case v1.ResourceMemory:
m.retrySetMemoryReserve(qosConfigs, percentReserve)
}
}
}
//最后更新对应的cgroup
for _, config := range qosConfigs {
err := m.cgroupManager.Update(config)
if err != nil {
klog.ErrorS(err, "Failed to update QoS cgroup configuration")
return err
}
}
klog.V(4).InfoS("Updated QoS cgroup configuration")
return nil
}
func (m *qosContainerManagerImpl) setCPUCgroupConfig(configs map[v1.PodQOSClass]*CgroupConfig) error {
pods := m.activePods()
burstablePodCPURequest := int64(0)
for i := range pods {
pod := pods[i]
//获取pod的qos级别
qosClass := v1qos.GetPodQOS(pod)
//只关心Burstable级别的pod
if qosClass != v1.PodQOSBurstable {
// we only care about the burstable qos tier
continue
}
//累加cpu资源信息
req, _ := resource.PodRequestsAndLimits(pod)
if request, found := req[v1.ResourceCPU]; found {
burstablePodCPURequest += request.MilliValue()
}
}
//BestEffort的cpu.share永远是2
// make sure best effort is always 2 shares
bestEffortCPUShares := uint64(MinShares)
configs[v1.PodQOSBestEffort].ResourceParameters.CpuShares = &bestEffortCPUShares
// set burstable shares based on current observe state
burstableCPUShares := MilliCPUToShares(burstablePodCPURequest)
configs[v1.PodQOSBurstable].ResourceParameters.CpuShares = &burstableCPUShares
return nil
}
// setMemoryReserve sums the memory limits of all pods in a QOS class,
// calculates QOS class memory limits, and set those limits in the
// CgroupConfig for each QOS class.
func (m *qosContainerManagerImpl) setMemoryReserve(configs map[v1.PodQOSClass]*CgroupConfig, percentReserve int64) {
qosMemoryRequests := m.getQoSMemoryRequests()
//getNodeAllocatable为函数 GetNodeAllocatableAbsolute
resources := m.getNodeAllocatable()
allocatableResource, ok := resources[v1.ResourceMemory]
if !ok {
klog.V(2).InfoS("Allocatable memory value could not be determined, not setting QoS memory limits")
return
}
allocatable := allocatableResource.Value()
if allocatable == 0 {
klog.V(2).InfoS("Allocatable memory reported as 0, might be in standalone mode, not setting QoS memory limits")
return
}
for qos, limits := range qosMemoryRequests {
klog.V(2).InfoS("QoS pod memory limit", "qos", qos, "limits", limits, "percentReserve", percentReserve)
}
// Calculate QOS memory limits
burstableLimit := allocatable - (qosMemoryRequests[v1.PodQOSGuaranteed] * percentReserve / 100)
bestEffortLimit := burstableLimit - (qosMemoryRequests[v1.PodQOSBurstable] * percentReserve / 100)
configs[v1.PodQOSBurstable].ResourceParameters.Memory = &burstableLimit
configs[v1.PodQOSBestEffort].ResourceParameters.Memory = &bestEffortLimit
}
// getQoSMemoryRequests sums and returns the memory request of all pods for
// guaranteed and burstable qos classes.
func (m *qosContainerManagerImpl) getQoSMemoryRequests() map[v1.PodQOSClass]int64 {
qosMemoryRequests := map[v1.PodQOSClass]int64{
v1.PodQOSGuaranteed: 0,
v1.PodQOSBurstable: 0,
}
// Sum the pod limits for pods in each QOS class
pods := m.activePods()
for _, pod := range pods {
podMemoryRequest := int64(0)
qosClass := v1qos.GetPodQOS(pod)
if qosClass == v1.PodQOSBestEffort {
// limits are not set for Best Effort pods
continue
}
req, _ := resource.PodRequestsAndLimits(pod)
if request, found := req[v1.ResourceMemory]; found {
podMemoryRequest += request.Value()
}
qosMemoryRequests[qosClass] += podMemoryRequest
}
return qosMemoryRequests
}
// GetNodeAllocatableAbsolute returns the absolute value of Node Allocatable which is primarily useful for enforcement.
// Note that not all resources that are available on the node are included in the returned list of resources.
// Returns a ResourceList.
func (cm *containerManagerImpl) GetNodeAllocatableAbsolute() v1.ResourceList {
return cm.getNodeAllocatableAbsoluteImpl(cm.capacity)
}
func (cm *containerManagerImpl) getNodeAllocatableAbsoluteImpl(capacity v1.ResourceList) v1.ResourceList {
result := make(v1.ResourceList)
for k, v := range capacity {
value := v.DeepCopy()
if cm.NodeConfig.SystemReserved != nil {
value.Sub(cm.NodeConfig.SystemReserved[k])
}
if cm.NodeConfig.KubeReserved != nil {
value.Sub(cm.NodeConfig.KubeReserved[k])
}
if value.Sign() < 0 {
// Negative Allocatable resources don't make sense.
value.Set(0)
}
result[k] = value
}
return result
}
c. pod level
在创建pod时,先调用UpdateQOSCgroups更新qos level的cgroup信息,再调用EnsureExists创建pod level的cgroup
func (kl *Kubelet) syncPod
pcm := kl.containerManager.NewPodContainerManager()
if !pcm.Exists(pod) {
//更新 burstable cgroup
kl.containerManager.UpdateQOSCgroups()
//在kubepod下创建pod的cgroup
pcm.EnsureExists(pod
}
代码路径:pkg/kubelet/cm/pod_container_manager.go
// EnsureExists takes a pod as argument and makes sure that
// pod cgroup exists if qos cgroup hierarchy flag is enabled.
// If the pod level container doesn't already exist it is created.
func (m *podContainerManagerImpl) EnsureExists(pod *v1.Pod) error {
podContainerName, _ := m.GetPodContainerName(pod)
// check if container already exist
alreadyExists := m.Exists(pod)
if !alreadyExists {
enforceMemoryQoS := false
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
libcontainercgroups.IsCgroup2UnifiedMode() {
enforceMemoryQoS = true
}
// Create the pod container
containerConfig := &CgroupConfig{
Name: podContainerName,
ResourceParameters: ResourceConfigForPod(pod, m.enforceCPULimits, m.cpuCFSQuotaPeriod, enforceMemoryQoS),
}
if m.podPidsLimit > 0 {
containerConfig.ResourceParameters.PidsLimit = &m.podPidsLimit
}
if enforceMemoryQoS {
klog.V(4).InfoS("MemoryQoS config for pod", "pod", klog.KObj(pod), "unified", containerConfig.ResourceParameters.Unified)
}
if err := m.cgroupManager.Create(containerConfig); err != nil {
return fmt.Errorf("failed to create container for %v : %v", podContainerName, err)
}
}
return nil
}
代码路径:pkg/kubelet/cm/helper_linux.go
// ResourceConfigForPod takes the input pod and outputs the cgroup resource config.
func ResourceConfigForPod(pod *v1.Pod, enforceCPULimits bool, cpuPeriod uint64, enforceMemoryQoS bool) *ResourceConfig {
// sum requests and limits.
reqs, limits := resource.PodRequestsAndLimits(pod)
cpuRequests := int64(0)
cpuLimits := int64(0)
memoryLimits := int64(0)
if request, found := reqs[v1.ResourceCPU]; found {
cpuRequests = request.MilliValue()
}
if limit, found := limits[v1.ResourceCPU]; found {
cpuLimits = limit.MilliValue()
}
if limit, found := limits[v1.ResourceMemory]; found {
memoryLimits = limit.Value()
}
// convert to CFS values
cpuShares := MilliCPUToShares(cpuRequests)
cpuQuota := MilliCPUToQuota(cpuLimits, int64(cpuPeriod))
// track if limits were applied for each resource.
memoryLimitsDeclared := true
cpuLimitsDeclared := true
// map hugepage pagesize (bytes) to limits (bytes)
hugePageLimits := map[int64]int64{}
for _, container := range pod.Spec.Containers {
if container.Resources.Limits.Cpu().IsZero() {
cpuLimitsDeclared = false
}
if container.Resources.Limits.Memory().IsZero() {
memoryLimitsDeclared = false
}
containerHugePageLimits := HugePageLimits(container.Resources.Requests)
for k, v := range containerHugePageLimits {
if value, exists := hugePageLimits[k]; exists {
hugePageLimits[k] = value + v
} else {
hugePageLimits[k] = v
}
}
}
for _, container := range pod.Spec.InitContainers {
if container.Resources.Limits.Cpu().IsZero() {
cpuLimitsDeclared = false
}
if container.Resources.Limits.Memory().IsZero() {
memoryLimitsDeclared = false
}
containerHugePageLimits := HugePageLimits(container.Resources.Requests)
for k, v := range containerHugePageLimits {
if value, exists := hugePageLimits[k]; !exists || v > value {
hugePageLimits[k] = v
}
}
}
// quota is not capped when cfs quota is disabled
if !enforceCPULimits {
cpuQuota = int64(-1)
}
// determine the qos class
qosClass := v1qos.GetPodQOS(pod)
// build the result
result := &ResourceConfig{}
if qosClass == v1.PodQOSGuaranteed {
result.CpuShares = &cpuShares
result.CpuQuota = &cpuQuota
result.CpuPeriod = &cpuPeriod
result.Memory = &memoryLimits
} else if qosClass == v1.PodQOSBurstable {
result.CpuShares = &cpuShares
if cpuLimitsDeclared {
result.CpuQuota = &cpuQuota
result.CpuPeriod = &cpuPeriod
}
if memoryLimitsDeclared {
result.Memory = &memoryLimits
}
} else {
shares := uint64(MinShares)
result.CpuShares = &shares
}
result.HugePageLimit = hugePageLimits
if enforceMemoryQoS {
memoryMin := int64(0)
if request, found := reqs[v1.ResourceMemory]; found {
memoryMin = request.Value()
}
if memoryMin > 0 {
result.Unified = map[string]string{
MemoryMin: strconv.FormatInt(memoryMin, 10),
}
}
}
return result
}
d. container level
调用路径startContainer -> generateContainerConfig -> applyPlatformSpecificContainerConfig -> generateLinuxContainerConfig
在函数generateLinuxContainerConfig中,将用户配置的request和limit资源信息转换到container的配置,最终会传给containerd来创建container和container level cgroup
代码路径:pkg/kubelet/kuberuntime/kuberuntime_container_linux.go
// generateLinuxContainerConfig generates linux container config for kubelet runtime v1.
func (m *kubeGenericRuntimeManager) generateLinuxContainerConfig(container *v1.Container, pod *v1.Pod, uid *int64, username string, nsTarget *kubecontainer.ContainerID, enforceMemoryQoS bool) *runtimeapi.LinuxContainerConfig {
...
// set linux container resources
var cpuShares int64
cpuRequest := container.Resources.Requests.Cpu()
cpuLimit := container.Resources.Limits.Cpu()
memoryLimit := container.Resources.Limits.Memory().Value()
memoryRequest := container.Resources.Requests.Memory().Value()
oomScoreAdj := int64(qos.GetContainerOOMScoreAdjust(pod, container,
int64(m.machineInfo.MemoryCapacity)))
// If request is not specified, but limit is, we want request to default to limit.
// API server does this for new containers, but we repeat this logic in Kubelet
// for containers running on existing Kubernetes clusters.
if cpuRequest.IsZero() && !cpuLimit.IsZero() {
cpuShares = milliCPUToShares(cpuLimit.MilliValue())
} else {
// if cpuRequest.Amount is nil, then milliCPUToShares will return the minimal number
// of CPU shares.
cpuShares = milliCPUToShares(cpuRequest.MilliValue())
}
lc.Resources.CpuShares = cpuShares
if memoryLimit != 0 {
lc.Resources.MemoryLimitInBytes = memoryLimit
}
// Set OOM score of the container based on qos policy. Processes in lower-priority pods should
// be killed first if the system runs out of memory.
lc.Resources.OomScoreAdj = oomScoreAdj
if m.cpuCFSQuota {
// if cpuLimit.Amount is nil, then the appropriate default value is returned
// to allow full usage of cpu resource.
cpuPeriod := int64(quotaPeriod)
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.CPUCFSQuotaPeriod) {
cpuPeriod = int64(m.cpuCFSQuotaPeriod.Duration / time.Microsecond)
}
cpuQuota := milliCPUToQuota(cpuLimit.MilliValue(), cpuPeriod)
lc.Resources.CpuQuota = cpuQuota
lc.Resources.CpuPeriod = cpuPeriod
}
...
return lc
}
参考
https://github.com/kubernetes/design-proposals-archive/blob/main/node/node-allocatable.mdhttps://github.com/kubernetes/design-proposals-archive/blob/main/node/resource-qos.mdhttps://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/7/html/resource_management_guide/sec-memory