文章目录
- Scheduling Framework
- 如何开始?
- 1. 写一个`KubeSchedulerConfiguration` yaml文件
- 2. 修改kube-scheduler容器配置
- 开发一个新的插件
- coding
- 更新插件配置为QueueSort阶段使用NoOp插件
- 测试效果
- 总结
Scheduling Framework
kubernetes自定义调度器使用schedule framework,schedule framework要求自定义调度逻辑以插件方式实现,类似回调函数。
参考:https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
如上图所示,调度过程大致分成预选和优选两个阶段,选出最合适的node后,将Pod和node进行绑定。
可以通过实现每个阶段预定义的接口,在里面写自定义逻辑。比如QueueSort
阶段(调度第一阶段,这个阶段用于给调度队列中的pod进行调度优先级排序),需要实现对比两个Pod优先级的接口。
Less(*v1.pod, *v1.pod) bool
如何开始?
基于github上的项目进行二次开发就可以。
- github开源地址 https://github.com/kubernetes-sigs/scheduler-plugins.git
- kubernetes版本: v1.23.5
在kubernetes默认调度流程中插入自定义逻辑的大概流程是:
- 写一个
KubeSchedulerConfiguration
文件,将其挂载到kube-scheduler容器中。KubeSchedulerConfiguration
配置了某些阶段回调哪个插件的逻辑。 - 修改kube-scheduler容器配置,一般是
/etc/kubernetes/manifests/kube-scheduler.yaml
。配置--config
参数指向KubeSchedulerConfiguration
文件在容器中的路径。修改保存之后,kubernetes会自动重启kube-system命名空间的scheduler容器让配置生效。
1. 写一个KubeSchedulerConfiguration
yaml文件
/etc/kubernetes/sched-cc.yaml
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
# (Optional) Change true to false if you are not running a HA control-plane.
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- schedulerName: default-scheduler
plugins:
# queueSort插件只使用CoScheduling插件
queueSort:
enabled:
- name: Coscheduling
disabled:
- name: "*"
preFilter:
enabled:
- name: Coscheduling
postFilter:
enabled:
- name: Coscheduling
permit:
enabled:
- name: Coscheduling
reserve:
enabled:
- name: Coscheduling
postBind:
enabled:
- name: Coscheduling
2. 修改kube-scheduler容器配置
/etc/kubernetes/manifests/kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
# 修改这里kube-scheduler的启动参数
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
# KubeSchedulerConfiguration文件在容器中的路径
- --config=/etc/kubernetes/sched-cc.yaml
- -v=9
# 开发自定义逻辑后,手动构建镜像的镜像tag
image: localhost:5000/scheduler-plugins/kube-scheduler:latest
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes
name: kubeconfig
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
# 挂载自定义KubeSchedulerConfiguration文件
- hostPath:
path: /etc/kubernetes/
type: Directory
name: kubeconfig
status: {}
开发一个新的插件
coding
除了仓库里本身有的,可以按照规范开发一个自己的插件。
在工程的cmd/scheduler/main.go
路径下,注册下自己的插件,参照其他插件代码实现自定义逻辑,重新构建镜像就可以将自定义逻辑插入k8s默认调度流程中。
func main() {
rand.Seed(time.Now().UnixNano())
// Register custom plugins to the scheduler framework.
// Later they can consist of scheduler profile(s) and hence
// used by various kinds of workloads.
command := app.NewSchedulerCommand(
app.WithPlugin(capacityscheduling.Name, capacityscheduling.New),
app.WithPlugin(coscheduling.Name, coscheduling.New),
app.WithPlugin(loadvariationriskbalancing.Name, loadvariationriskbalancing.New),
app.WithPlugin(noderesources.AllocatableName, noderesources.NewAllocatable),
app.WithPlugin(noderesourcetopology.Name, noderesourcetopology.New),
app.WithPlugin(preemptiontoleration.Name, preemptiontoleration.New),
app.WithPlugin(targetloadpacking.Name, targetloadpacking.New),
// 新增加一行,实现了Name和New函数的插件
app.WithPlugin(noop.Name,noop.New)
)
// TODO: once we switch everything over to Cobra commands, we can go back to calling
// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
// normalize func and add the go flag set by hand.
// utilflag.InitFlags()
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
os.Exit(1)
}
}
自定义调度逻辑:
简单描述一下,紧急情况下,需要优先调度label为emergency: red
的Pod. 如果不是紧急Pod,那么就按照优先级排。
const Name = "NoOp"
// NoOp is a plugin that do nothing just
type NoOp struct{}
func (pl *NoOp) Less(info *framework.QueuedPodInfo, info2 *framework.QueuedPodInfo) bool {
p1 := corev1helpers.PodPriority(info.Pod)
p2 := corev1helpers.PodPriority(info2.Pod)
// if emergency is code red, then schedule it in priority
if emer,ok:=info.Pod.Labels["emergency"];ok && emer=="red"{
return true
} else{
return p1 > p2
}
}
var _ framework.QueueSortPlugin = &NoOp{}
// Name returns name of the plugin.
func (pl *NoOp) Name() string {
return Name
}
// New initializes a new plugin and returns it.
func New(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
return &NoOp{}, nil
}
更新插件配置为QueueSort阶段使用NoOp插件
修改/etc/kubernetes/sched-cc.yaml
为:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
# (Optional) Change true to false if you are not running a HA control-plane.
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- schedulerName: default-scheduler
plugins:
# queueSort插件只使用CoScheduling插件
queueSort:
enabled:
- name: NoOp
disabled:
- name: "*"
重新构建镜像后make localimage
,重启kube-system的scheduler容器(直接杀掉)使得插件配置生效。
测试效果
使用kubectl apply -f
以下文件,先创建一个非紧急,再创建一个紧急。
# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pause
spec:
replicas: 4
selector:
matchLabels:
app: pause
template:
metadata:
labels:
app: pause
pod-group.scheduling.sigs.k8s.io: pg1
spec:
containers:
- name: nginx
image: nginx
---
# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: red
spec:
replicas: 1
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
emergency: red
spec:
containers:
- name: nginx
image: nginx
查看调度结果为,虽然紧急Pod是之后apply的,但是调度却优先于非紧急pod
^C[root@localhost kubernetes]# kubectl get pods -w
NAME READY STATUS RESTARTS AGE
pause-c9db6b47f-l2xg9 0/1 Pending 0 0s
red-77bffc7d58-zsvjc 0/1 Pending 0 0s
pause-c9db6b47f-k7hgz 0/1 Pending 0 0s
pause-c9db6b47f-l2xg9 0/1 Pending 0 0s
pause-c9db6b47f-gp5sw 0/1 Pending 0 0s
red-77bffc7d58-zsvjc 0/1 Pending 0 0s
pause-c9db6b47f-k7hgz 0/1 Pending 0 0s
pause-c9db6b47f-hhs7d 0/1 Pending 0 0s
pause-c9db6b47f-gp5sw 0/1 Pending 0 0s
pause-c9db6b47f-hhs7d 0/1 Pending 0 0s
pause-c9db6b47f-l2xg9 0/1 ContainerCreating 0 0s
red-77bffc7d58-zsvjc 0/1 ContainerCreating 0 0s
pause-c9db6b47f-gp5sw 0/1 ContainerCreating 0 0s
pause-c9db6b47f-k7hgz 0/1 ContainerCreating 0 0s
pause-c9db6b47f-hhs7d 0/1 ContainerCreating 0 0s
# 优先启动了紧急pod
red-77bffc7d58-zsvjc 1/1 Running 0 4s
pause-c9db6b47f-hhs7d 1/1 Running 0 6s
pause-c9db6b47f-k7hgz 1/1 Running 0 8s
pause-c9db6b47f-l2xg9 1/1 Running 0 9s
pause-c9db6b47f-gp5sw 1/1 Running 0 10s
总结
本文只是做一个二次开发的概念验证,主要注重理解二次开发代码逻辑的插入默认调度流程的方法。
学会如何开发一个简单的插件后,就可以实现更加复杂的逻辑。比如
- 基于prometheus监控合理分配Pod的运行节点,消除集群倾斜现象
- 同时启动100个Pod用于spark计算