重学计算机（十七、linux调度器和调度器类）

原创

酱油师兄 2022-03-09 15:35:26 博主文章分类：重学计算机 ©著作权

文章标签 linux调度器调度器类调度实体进程调度优先级 文章分类 运维

©著作权归作者所有：来自51CTO博客作者酱油师兄的原创作品，请联系作者获取转载授权，否则将追究法律责任

没想到上一篇只写了一个优先级，这一篇尽量把linux调度器整体架构缕清楚，下一篇正式开始CFS完全公平调度器。

17.1 整体框架

我感觉还是喜欢从整体到细节，虽然现在介绍整体比较懵逼，不过有一个整体的概念，然后在慢慢的细化分析，等分析完了，再回来看整体框架，就感觉很清晰。

重学计算机（十七、linux调度器和调度器类）_优先级

这是从《深入linux内核架构》里面抄出来的图，我第一次见这个图就比较懵逼，为什么调度类上面还有主调度器和周期性调度器，调度类和主调器和周期性调度器究竟是什么关系。

调度器类和进程的关系，都是比较明显，因为linux系统只有有多个调度器类，比如有CFS完全公平调度器类，还有一个实时调度器类，在无事好做时调度空闲进程。普通进程都属于CFS完全公平调度器的，要求实时进程当然就属于实时调度器类。

17.2 调度器

接下来我们来分析一下调度器的实现，调度器的实现基于上面两个函数：周期行调度器函数和主调度器函数。我们接下里看看：

17.2.1 周期性调度器

周期性调度器相对来说简单一点，周期性调度器是在scheduler_tick中实现，如果系统正在工作，内核会按照频率Hz自动调用该函数。具体的我们后面有缘再介绍。（很有可能不会介绍，因为太底层的东西好处也不是很大）

我们直接来看看源码：

// kernel/sched/core.c
// 虽然好多细节我也不知道，就是因为当年老是分析内核细节，一下子就绕进去了，这次避免分析细节，等到整体抓的差不多再分析细节。
/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
    int cpu = smp_processor_id();   // 多核cpu
    struct rq *rq = cpu_rq(cpu);    // 应该是获取当前CPU运行的队列
    struct task_struct *curr = rq->curr;    // 通过运行队列获取到当前运行进程的控制块

    sched_clock_tick();     // 可能是调整什么时钟吧，这个忽略

    raw_spin_lock(&rq->lock);    // 内核的锁，忽略，以后再分析
    update_rq_clock(rq);        // 这个重要一点，更新就绪队列时钟的更新，这个就绪队列我们以后还会接触，就是就绪的进程都在这里
    curr->sched_class->task_tick(rq, curr, 0);      // 这个才是最重点的，是当前进程的控制块中有调度器类的指针，这个调度器类的task_tick，意思就是如果这个进程是CFS调度的，这个task_tick就是CFS调度类的，不同调度器类对task_tick处理不一样，所以需要这样来调用
    update_cpu_load_active(rq);   // 负责更新就绪队列的cpu_load[]数组（说实话我也没看懂，哈哈）
    calc_global_load_tick(rq);
    raw_spin_unlock(&rq->lock);   // 解锁

    perf_event_task_tick();

    // 这个是多和CPU的时候，如果有一个核比较空闲，就会做一下负载均衡
#ifdef CONFIG_SMP
    rq->idle_balance = idle_cpu(cpu);
    trigger_load_balance(rq);
#endif
    rq_last_tick_reset(rq);     //更新运行队列的时钟
}

虽然这个代码分析的，有一大半还看懂，但是我们还是看出了核心，就是

curr->sched_class->task_tick(rq, curr, 0);

通过进程去控制调度器类，然后调度器类再做相应的处理，等我们分析到CFS的时候，在来看看这个调度器类是做什么的。

17.2.2 主调度器

我们刚刚分析周期性调度器的时候，是不是很开心，感觉很简单，不过别高兴太早，这个主调度器不会太简单的，压力山大。。。。

周期性调度器是通过定时触发的，那是因为周期性调度器主要是判断进程的运行的时间，是否已经达到了自己的时间片，如果达到了，就会做相应的处理（这里没有说直接抢占，等到分析CFS的时候就会明白）。

主调度器被调用的地方就比较多了，比如：当前进程主动让出CPU，还有判断重调度标记等。我刚刚想在内核代码中搜索一下啥时候调用主调度器，结果搜出了一大推，然后就放弃了，也等到CFS的时候，看看有什么发现吧。

吹水吹完了，接下来上代码：

//#define __sched       __attribute__((__section__(".sched.text")))
// 这个有看前面的章节就知道，gcc自定义的一个段，把调度程序全部集中在.sched.text段中，这种做法是在显示堆栈信息时，忽略与调度相关的部分。所以我们在函数调用的时候，是看不到这些部分的。
asmlinkage __visible void __sched schedule(void)
{
    struct task_struct *tsk = current;  
    //current 这个全局变量有点意思，并且在内核中，使用的次数也是比较多的，只想当前进程控制块，这个变量的存储，我们到进程控制块那章再介绍，有兴趣的可以先看，为了加快访问速度，是存储在寄存器中的。

    sched_submit_work(tsk);   // 不知道这个干啥的
    do {
        preempt_disable();      // 进程控制块中有一个计数器preempt_count，当数值为0的时候，表示可以抢占，不为0不能抢占，这个函数会把计数器perrmpt_count+1,
        __schedule(false);      // 这个是主要的调度函数,下面详解分析
        sched_preempt_enable_no_resched();   // 这个就是把preempt_count-1
    } while (need_resched());   
}
EXPORT_SYMBOL(schedule);

集中精力分析__schedule，看能不能看懂，看不懂也要看一个大概就可以了。

/*
 * __schedule() is the main scheduler function.
   __schedule()是主要的调度函数
 *
 * The main means of driving the scheduler and thus entering this function are:
   驱动调度器进入这个函数的主要方法是:
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
   1. 阻塞：互斥锁,信号量，等待队列
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
    在中断和用户空间返回路径上检查TIF_NEED_RESCHED标志
 *    
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
      任务之间的抢占，调度器在定时器中断处理程序scheduler_tick()中设置TIF_NEED_RESCHED标志
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
   唤醒并不会真正导致进入schedule()，他们将一个任务添加到运行队列中，仅此而已
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
   现在，如果添加到运行队列中的新任务抢占当前任务，然后唤醒设置TIF_NEED_RESCHED，并在最近的可能场合调用schedule()
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPT=y):
      如果内核可抢占，linux内核是抢占式的
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
         在系统调用或异常上下文中，下一个调用preempt_enable()。这可能需要wake_up()的spin_unlock()。
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
         在IRQ上下文中，从中断处理程序返回到抢占上下文
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
 *         then at the next:
 *
 *          - cond_resched() call     cond_resched()调用
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
         从系统调用或异常返回到用户空间
 *          - return from interrupt-handler to user-space
         从中断处理程序返回到用户空间
 *
 * WARNING: must be called with preemption disabled!
   警告:必须在禁用抢占的情况下调用! （刚开始的时候就设置了标志）
 */
static void __sched notrace __schedule(bool preempt)
{
   struct task_struct *prev, *next;
   unsigned long *switch_count;
   struct rq *rq;
   int cpu;

   cpu = smp_processor_id();    // 终于看到操作多核CPU了
   rq = cpu_rq(cpu);
   rcu_note_context_switch();
   prev = rq->curr;

   /*
    * do_exit() calls schedule() with preemption disabled as an exception;
    * however we must fix that up, otherwise the next task will see an
    * inconsistent (higher) preempt count.
    do_exit()调用schedule()时异常禁用抢占;但是我们必须解决这个问题，否则下一个任务将看到一个不一致的(更高的)抢占计数
    *
    * It also avoids the below schedule_debug() test from complaining
    * about this.
    它还避免了下面的schedule_debug()测试对此进行抱怨。(有道翻译的哈哈)
    */
    // 就是上面的翻译，如果进程挂了，要把计数器给减1
   if (unlikely(prev->state == TASK_DEAD))
      preempt_enable_no_resched_notrace();

    // 各种schedule()时间调试检查和统计信息:(注释是这么说的)
   schedule_debug(prev);

   if (sched_feat(HRTICK))
      hrtick_clear(rq);

   /*
    * Make sure that signal_pending_state()->signal_pending() below
    * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
    * done by the caller to avoid the race with signal_wake_up().
   确保下面的signal_pending_state()->signal_pending()不能被调用方的__set_current_state(TASK_INTERRUPTIBLE)重新排序，以避免与signal_wake_up()的竞争。
    */
   smp_mb__before_spinlock();
   raw_spin_lock_irq(&rq->lock);
   lockdep_pin_lock(&rq->lock);

   rq->clock_skip_update <<= 1; /* promote REQ to ACT */

   switch_count = &prev->nivcsw;
   if (!preempt && prev->state) {   // 这个判断有点意思，preempt是传参过来的，这次传的是false,prev->state是进程状态，>0不是运行态，也就是说，抢占是不走这个分支？
      if (unlikely(signal_pending_state(prev->state, prev))) {
         prev->state = TASK_RUNNING;
      } else {
            // 先前的进程不再处于可执行状态，需要将其从运行队列中移除出去。
         deactivate_task(rq, prev, DEQUEUE_SLEEP);
         prev->on_rq = 0;

         /*
          * If a worker went to sleep, notify and ask workqueue
          * whether it wants to wake up a task to maintain
          * concurrency.
          如果一个worker进入睡眠状态，通知并询问workqueue是否需要唤醒一个task来保持并发
          */
         if (prev->flags & PF_WQ_WORKER) {
            struct task_struct *to_wakeup;

            to_wakeup = wq_worker_sleeping(prev, cpu);
            if (to_wakeup)
               try_to_wake_up_local(to_wakeup);
         }
      }
      switch_count = &prev->nvcsw;
   }

   if (task_on_rq_queued(prev))
      update_rq_clock(rq);    // 在更新运行队列的东西

    // 这个就直接选择下一进程了，太尼玛快了
   next = pick_next_task(rq, prev);
   clear_tsk_need_resched(prev);    // 清除标记
   clear_preempt_need_resched();
   rq->clock_skip_update = 0;       // 把上面的标记清除

    // 如果选中的进程不是之前的进程，需要上下文切换
   if (likely(prev != next)) {
      rq->nr_switches++;
      rq->curr = next;
      ++*switch_count;

      trace_sched_switch(preempt, prev, next);  // 不知道是啥
        // 传说的上下文切换？
      rq = context_switch(rq, prev, next); /* unlocks the rq */
      cpu = cpu_of(rq);
   } else {
      lockdep_unpin_lock(&rq->lock);
      raw_spin_unlock_irq(&rq->lock);
   }

    // 多CPU做负载均衡
   balance_callback(rq);
}

感觉吧，虽然也能抓住核心，但是总是缺少了点啥，应该是缺少了细节，不过目前实力不够，就不沉迷细节了，如果去纠细节的话，会沉迷进去的，接下来我们看pick_next_task选择下一进程的函数。

/*
 * Pick up the highest-prio task:
 选择优先级最高的任务:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev)
{
    const struct sched_class *class = &fair_sched_class;
    struct task_struct *p;

    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     优化:我们知道如果所有任务都在公平类中，我们可以直接调用该函数
     */
    // 如果是CFS完全公平调度器的话，就执行下面的
    if (likely(prev->sched_class == class &&
           rq->nr_running == rq->cfs.h_nr_running)) {
        p = fair_sched_class.pick_next_task(rq, prev);
        if (unlikely(p == RETRY_TASK))
            goto again;

        /* assumes fair_sched_class->next == idle_sched_class */
        // CFS完全公平调度类都没有选择出来，那就调用空闲调度类
        if (unlikely(!p))
            p = idle_sched_class.pick_next_task(rq, prev);

        return p;
    }

again:
    // 这个是遍历调度器类，可以往后看
    for_each_class(class) {
        p = class->pick_next_task(rq, prev);  // 直接找到一个进程，pick_next_task函数是调度器类的，等下一节再详细分析
        if (p) {
            if (unlikely(p == RETRY_TASK))
                goto again;
            return p;
        }
    }

    BUG(); /* the idle class will always have a runnable task
    空闲类将始终有一个可运行的任务*/
}

我们直接看看for_each_class的代码：

#define sched_class_highest (&stop_sched_class)
#define for_each_class(class) \
for (class = sched_class_highest; class; class = class->next)

for_each_class是按照linux调度类的优先级遍历的，找到各自的优先级中有没有就绪的进程，然后进行调度。下面是我按linux调度类的优先级排列的：

stop_sched_class（停止类）
dl_sched_class（最终期限调度类）
rt_sched_class（实时调度类）
fair_sched_class（CFS完全公平调度类）
idle_sched_class（空闲调度类）

讲到这里，主调度器基本讲完了，虽然很多细节不管，但基本核心还是抓住了，就是按照调度器类的优先级来查找合适的进程，进行进程切换，具体的等看具体的调度类，我们再分析。

17.3 调度器类

本来之前是有安排fork和上下文切换，但是想想fork留着讲完CFS的时候再讲，上下文切换等到以后功力深厚了再分析把，上下文切换的大概就是保存寄存器，保存内存，堆栈里的值，这些以后再分析，希望还有分析的机会。

17.3.1 调度器类的抽象

接下来我们看看调度器类的抽象，虽说c语言不是面向对象的语言，但是在linux内核中，基本都是面向对象的思想，这个调度器类也是这种思想，下面就来看看这个抽象：

struct sched_class {
    const struct sched_class *next;

    // 向就绪队列添加一个新进程
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    // 将一个进程从就绪队列移除
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    // 进程想要资源放弃对CPU的控制权
    void (*yield_task) (struct rq *rq);
    // 
    bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);

    // 用一个新唤醒的进程来抢占当前进程
    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

    /*
     * It is the responsibility of the pick_next_task() method that will
     * return the next task to call put_prev_task() on the @prev task or
     * something equivalent.
     *
     * May return RETRY_TASK when it finds a higher prio class has runnable
     * tasks.
     */
    // 选择下一个将要运行的进程，上面我们就分析了
    struct task_struct * (*pick_next_task) (struct rq *rq,
                        struct task_struct *prev);
    // 进程切换上下文之前的准备工作
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP  // CONFIG_SMP 这是多核CPU
    int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
    void (*migrate_task_rq)(struct task_struct *p);

    void (*task_waking) (struct task_struct *task);
    void (*task_woken) (struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif

    void (*set_curr_task) (struct rq *rq);
    // 这个就是周期性调度器调用的
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);
    void (*task_dead) (struct task_struct *p);

    /*
     * The switched_from() call is allowed to drop rq->lock, therefore we
     * cannot assume the switched_from/switched_to pair is serliazed by
     * rq->lock. They are however serialized by p->pi_lock.
     */
    void (*switched_from) (struct rq *this_rq, struct task_struct *task);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                 int oldprio);

    unsigned int (*get_rr_interval) (struct rq *rq,
                     struct task_struct *task);

    void (*update_curr) (struct rq *rq);

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_move_group) (struct task_struct *p);
#endif
};

每一个调度器类，都要实现这些方法，通过周期性调度器或者主调度器来调用这些方法，就行进程调度。

上面我们也分析了目前有5种调度器类，每一个都有自己的实现，我们下一节就分析CFS完全调度器。

17.3.2 就绪队列

主调度器用于管理活动进程的主要数据结构称为就绪队列。各个CPU都有自身的就绪队列，各个活动进程只出现在一个就绪队列中。

我们也可看看就绪队列的结构：

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
  /* runqueue lock: */
  raw_spinlock_t lock;

  /*
   * nr_running and cpu_load should be in the same cacheline because
   * remote CPUs use both these fields when doing load calculation.
   nr_running和cpu_load应该在同一个cacheline中，因为远程cpu在进行负载计算时使用这两个字段
   */
  unsigned int nr_running;  // 指定了队列上可运行进程的数目

  #define CPU_LOAD_IDX_MAX 5
  unsigned long cpu_load[CPU_LOAD_IDX_MAX]; // 用于跟踪此前的负荷状态
  unsigned long last_load_update_tick;

  /* capture load from *all* tasks on this cpu: */
  struct load_weight load;    // 提供了就绪队列当前负荷的度量。
  unsigned long nr_load_updates;
  u64 nr_switches;

  struct cfs_rq cfs;    // 嵌入子就绪队列
  struct rt_rq rt;
  struct dl_rq dl;


  /*
   * This is part of a global counter where only the total sum
   * over all CPUs matters. A task can increase this counter on
   * one CPU and if it got migrated afterwards it may decrease
   * it on another CPU. Always updated under the runqueue lock:
   */
  unsigned long nr_uninterruptible;

  struct task_struct *curr, *idle, *stop;
  unsigned long next_balance;
  struct mm_struct *prev_mm;

  unsigned int clock_skip_update;
  u64 clock;
  u64 clock_task;

  atomic_t nr_iowait;
};

这个结构体省了一大半，结果这个结构体还是没看的懂，那只能等到后面分析了，这里就有一个印象了。

17.3.3 调度实体

linux内核调度的实体：

struct sched_entity {
    // 这个就是我们上一篇算的优先级保存的结构，一个负荷权重
    struct load_weight  load;       /* for load-balancing */
    struct rb_node      run_node;  // 红黑树的节点，排序使用的
    struct list_head    group_node; 
    unsigned int        on_rq;      // 表示该实体是否在就绪队列上接受调度

    // 这几个时间我们下一节再分析
    u64         exec_start;
    u64         sum_exec_runtime;
    u64         vruntime;
    u64         prev_sum_exec_runtime;

    u64         nr_migrations;

#ifdef CONFIG_SCHEDSTATS
    struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    int         depth;
    struct sched_entity *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq       *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq       *my_q;
#endif

#ifdef CONFIG_SMP
    /* Per entity load average tracking */
    struct sched_avg    avg;
#endif
};