文章目录

  • Scheduling Framework
  • 如何开始?
  • 1. 写一个`KubeSchedulerConfiguration` yaml文件
  • 2. 修改kube-scheduler容器配置
  • 开发一个新的插件
  • coding
  • 更新插件配置为QueueSort阶段使用NoOp插件
  • 测试效果
  • 总结

Scheduling Framework

kubernetes自定义调度器使用schedule framework,schedule framework要求自定义调度逻辑以插件方式实现,类似回调函数。

参考:https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/

Kubernetes中的默认调度器 kubernetes自定义调度器_ci


如上图所示,调度过程大致分成预选和优选两个阶段,选出最合适的node后,将Pod和node进行绑定。

可以通过实现每个阶段预定义的接口,在里面写自定义逻辑。比如QueueSort阶段(调度第一阶段,这个阶段用于给调度队列中的pod进行调度优先级排序),需要实现对比两个Pod优先级的接口。

Less(*v1.pod, *v1.pod) bool

如何开始?

基于github上的项目进行二次开发就可以。

  • github开源地址 https://github.com/kubernetes-sigs/scheduler-plugins.git
  • kubernetes版本: v1.23.5

在kubernetes默认调度流程中插入自定义逻辑的大概流程是:

  1. 写一个KubeSchedulerConfiguration文件,将其挂载到kube-scheduler容器中。KubeSchedulerConfiguration 配置了某些阶段回调哪个插件的逻辑。
  2. 修改kube-scheduler容器配置,一般是/etc/kubernetes/manifests/kube-scheduler.yaml。配置--config参数指向KubeSchedulerConfiguration文件在容器中的路径。修改保存之后,kubernetes会自动重启kube-system命名空间的scheduler容器让配置生效。

1. 写一个KubeSchedulerConfiguration yaml文件

  • /etc/kubernetes/sched-cc.yaml
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
  # (Optional) Change true to false if you are not running a HA control-plane.
  leaderElect: true
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
  - schedulerName: default-scheduler
    plugins:
      # queueSort插件只使用CoScheduling插件
      queueSort:
        enabled:
          - name: Coscheduling
        disabled:
          - name: "*"
      preFilter:
        enabled:
          - name: Coscheduling
      postFilter:
        enabled:
          - name: Coscheduling
      permit:
        enabled:
          - name: Coscheduling
      reserve:
        enabled:
          - name: Coscheduling
      postBind:
        enabled:
          - name: Coscheduling

2. 修改kube-scheduler容器配置

  • /etc/kubernetes/manifests/kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    # 修改这里kube-scheduler的启动参数
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    # KubeSchedulerConfiguration文件在容器中的路径
    - --config=/etc/kubernetes/sched-cc.yaml
    - -v=9
    # 开发自定义逻辑后,手动构建镜像的镜像tag
    image: localhost:5000/scheduler-plugins/kube-scheduler:latest
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes
      name: kubeconfig
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  # 挂载自定义KubeSchedulerConfiguration文件
  - hostPath:
      path: /etc/kubernetes/
      type: Directory
    name: kubeconfig
status: {}

开发一个新的插件

coding

除了仓库里本身有的,可以按照规范开发一个自己的插件。

在工程的cmd/scheduler/main.go路径下,注册下自己的插件,参照其他插件代码实现自定义逻辑,重新构建镜像就可以将自定义逻辑插入k8s默认调度流程中。

func main() {
	rand.Seed(time.Now().UnixNano())

	// Register custom plugins to the scheduler framework.
	// Later they can consist of scheduler profile(s) and hence
	// used by various kinds of workloads.
	command := app.NewSchedulerCommand(
		app.WithPlugin(capacityscheduling.Name, capacityscheduling.New),
		app.WithPlugin(coscheduling.Name, coscheduling.New),
		app.WithPlugin(loadvariationriskbalancing.Name, loadvariationriskbalancing.New),
		app.WithPlugin(noderesources.AllocatableName, noderesources.NewAllocatable),
		app.WithPlugin(noderesourcetopology.Name, noderesourcetopology.New),
		app.WithPlugin(preemptiontoleration.Name, preemptiontoleration.New),
		app.WithPlugin(targetloadpacking.Name, targetloadpacking.New),

		// 新增加一行,实现了Name和New函数的插件
		app.WithPlugin(noop.Name,noop.New)
	)

	// TODO: once we switch everything over to Cobra commands, we can go back to calling
	// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
	// normalize func and add the go flag set by hand.
	// utilflag.InitFlags()
	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		os.Exit(1)
	}
}

自定义调度逻辑:

简单描述一下,紧急情况下,需要优先调度label为emergency: red的Pod. 如果不是紧急Pod,那么就按照优先级排。

const Name = "NoOp"

// NoOp is a plugin that do nothing just
type NoOp struct{}

func (pl *NoOp) Less(info *framework.QueuedPodInfo, info2 *framework.QueuedPodInfo) bool {
	p1 := corev1helpers.PodPriority(info.Pod)
	p2 := corev1helpers.PodPriority(info2.Pod)
	// if emergency is code red, then schedule it in priority
	if emer,ok:=info.Pod.Labels["emergency"];ok && emer=="red"{
		 return true
	} else{
		 return p1 > p2
	}
}

var _ framework.QueueSortPlugin = &NoOp{}

// Name returns name of the plugin.
func (pl *NoOp) Name() string {
	return Name
}

// New initializes a new plugin and returns it.
func New(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
	return &NoOp{}, nil
}

更新插件配置为QueueSort阶段使用NoOp插件

修改/etc/kubernetes/sched-cc.yaml为:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
  # (Optional) Change true to false if you are not running a HA control-plane.
  leaderElect: true
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
  - schedulerName: default-scheduler
    plugins:
      # queueSort插件只使用CoScheduling插件
      queueSort:
        enabled:
          - name: NoOp
        disabled:
          - name: "*"

重新构建镜像后make localimage,重启kube-system的scheduler容器(直接杀掉)使得插件配置生效。

测试效果

使用kubectl apply -f 以下文件,先创建一个非紧急,再创建一个紧急。

# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
spec:
  replicas: 4
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
        pod-group.scheduling.sigs.k8s.io: pg1
    spec:
      containers:
        - name: nginx
          image: nginx

---

# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: red
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
        emergency: red
    spec:
      containers:
        - name: nginx
          image: nginx

查看调度结果为,虽然紧急Pod是之后apply的,但是调度却优先于非紧急pod

^C[root@localhost kubernetes]# kubectl get pods -w
NAME                    READY   STATUS    RESTARTS   AGE
pause-c9db6b47f-l2xg9   0/1     Pending   0          0s
red-77bffc7d58-zsvjc    0/1     Pending   0          0s
pause-c9db6b47f-k7hgz   0/1     Pending   0          0s
pause-c9db6b47f-l2xg9   0/1     Pending   0          0s
pause-c9db6b47f-gp5sw   0/1     Pending   0          0s
red-77bffc7d58-zsvjc    0/1     Pending   0          0s
pause-c9db6b47f-k7hgz   0/1     Pending   0          0s
pause-c9db6b47f-hhs7d   0/1     Pending   0          0s
pause-c9db6b47f-gp5sw   0/1     Pending   0          0s
pause-c9db6b47f-hhs7d   0/1     Pending   0          0s
pause-c9db6b47f-l2xg9   0/1     ContainerCreating   0          0s
red-77bffc7d58-zsvjc    0/1     ContainerCreating   0          0s
pause-c9db6b47f-gp5sw   0/1     ContainerCreating   0          0s
pause-c9db6b47f-k7hgz   0/1     ContainerCreating   0          0s
pause-c9db6b47f-hhs7d   0/1     ContainerCreating   0          0s
# 优先启动了紧急pod
red-77bffc7d58-zsvjc    1/1     Running             0          4s
pause-c9db6b47f-hhs7d   1/1     Running             0          6s
pause-c9db6b47f-k7hgz   1/1     Running             0          8s
pause-c9db6b47f-l2xg9   1/1     Running             0          9s
pause-c9db6b47f-gp5sw   1/1     Running             0          10s

总结

本文只是做一个二次开发的概念验证,主要注重理解二次开发代码逻辑的插入默认调度流程的方法。

学会如何开发一个简单的插件后,就可以实现更加复杂的逻辑。比如

  • 基于prometheus监控合理分配Pod的运行节点,消除集群倾斜现象
  • 同时启动100个Pod用于spark计算