Kubernetes Scheduler原理分析

调度器的作用是将待调度的Pod按照特定的调度算法和调度策略绑定到集群中的某个合适的Node上,并将信息写入etcd中。目标节点上的kubelet通过API Server监听到Kubernetes Scheduler产生的Pod绑定事件,获取对应的Pod清单,下载Image镜像。


  1. 待调度Pod列表
  2. 可用Node列表
  3. 调度算法和调度策略

4. 预选调度过程:遍历所有目标Node,筛选出符合要求的候选节点,k8s内置了多种预选策略。
5. k8s的调度算法是贪心算法,具体来说是通过采用优选策略计算出每个候选节点的打分,选出打分最高的节点。




(1)NoDiskConflict: 是否有磁盘冲突
(2)PodFitsResource: 不仅仅包含cpu与内存是否满足,还可以是pod中需要的任意资源
(3)PodSelectorMatches: 通过nodeSelector指定了选择某个节点



  1. LeastRequestPriority:该策略用于从备选节点列表中选择出资源消耗最小的节点
    2.CalculateNodeLabelPriority: 该策略用于判断列出的标签在备选节点中存在时,是否选择该节点。在优选策略的标签列表中score=10,否则score=0。
  2. BalancedResourceAllocation: 改优选策略用于从备选节点列表中选出各项资源利用率最均衡的节点。涉及的资源只有:cpu和memory


  1. 首先从Scheduler的数据结构入手:
type Scheduler struct {
	// It is expected that changes made via SchedulerCache will be observed
	// by NodeLister and Algorithm.
	SchedulerCache internalcache.Cache

	Algorithm core.ScheduleAlgorithm

	// NextPod should be a function that blocks until the next pod
	// is available. We don't use a channel for this, because scheduling
	// a pod may take some amount of time and we don't want pods to get
	// stale while they sit in a channel.
	NextPod func() *framework.QueuedPodInfo

	// Error is called if there is an error. It is passed the pod in
	// question, and the error
	Error func(*framework.QueuedPodInfo, error)

	// Close this to shut down the scheduler.
	StopEverything <-chan struct{}

	// SchedulingQueue holds pods to be scheduled
	SchedulingQueue internalqueue.SchedulingQueue

	// Profiles are the scheduling profiles.
	Profiles profile.Map

	client clientset.Interface


  • 调度缓存主要是为了避免每次调度都要去获取nodeinfo,其组成结构为:
type schedulerCache struct {
	stop   <-chan struct{}
	ttl    time.Duration
	period time.Duration

	// This mutex guards all fields within this cache struct.
	mu sync.RWMutex
	// a set of assumed pod keys.
	// The key could further be used to get an entry in podStates.
	assumedPods map[string]bool
	// a map from pod key to podState.
	podStates map[string]*podState
	nodes     map[string]*nodeInfoListItem
	// headNode points to the most recently updated NodeInfo in "nodes". It is the
	// head of the linked list.
	headNode *nodeInfoListItem
	nodeTree *nodeTree
	// A map from image name to its imageState.
	imageStates map[string]*imageState


  • PriorityQueue队列数据结构:
type PriorityQueue struct {
	// PodNominator abstracts the operations to maintain nominated Pods.

	stop  chan struct{}
	clock util.Clock

	// pod initial backoff duration.
	podInitialBackoffDuration time.Duration
	// pod maximum backoff duration.
	podMaxBackoffDuration time.Duration

	lock sync.RWMutex
	cond sync.Cond

	// activeQ is heap structure that scheduler actively looks at to find pods to
	// schedule. Head of heap is the highest priority pod.
	activeQ *heap.Heap
	// podBackoffQ is a heap ordered by backoff expiry. Pods which have completed backoff
	// are popped from this heap before the scheduler looks at activeQ
	podBackoffQ *heap.Heap
	// unschedulableQ holds pods that have been tried and determined unschedulable.
	unschedulableQ *UnschedulablePodsMap
	// schedulingCycle represents sequence number of scheduling cycle and is incremented
	// when a pod is popped.
	schedulingCycle int64
	// moveRequestCycle caches the sequence number of scheduling cycle when we
	// received a move request. Unscheduable pods in and before this scheduling
	// cycle will be put back to activeQueue if we were trying to schedule them
	// when we received move request.
	moveRequestCycle int64

	// closed indicates that the queue is closed.
	// It is mainly used to let Pop() exit its control loop while waiting for an item.
	closed bool
