重学计算机（十九、CFS完全公平调度器＜中：周期调度和新进程调度＞）

原创

酱油师兄 2022-03-03 13:47:42 博主文章分类：重学计算机 ©著作权

文章标签 CFS完全公平调度器周期性调度新进程加入调度新进程调度 CFS 文章分类 Html/CSS 前端开发

©著作权归作者所有：来自51CTO博客作者酱油师兄的原创作品，请联系作者获取转载授权，否则将追究法律责任

上一篇介绍了CFS完全调度器中的虚拟时间的计算和红黑树的操作，这一篇我们来开始分析调度器的代码了。

19.1 CFS完全公平调度类实现

差点漏了这个，我们先来看看CFS完全公平调度类的实现，因为在前面我们已经知道了调度器来会抽象出来一堆接口，然后给我们自己的调度器类实现，现在我们就在分析CFS完全公平调度器，是时候要看看了。

// kernel/sched/fair.c

/*
 * All the scheduling class methods:
 */
const struct sched_class fair_sched_class = {
  .next     = &idle_sched_class,         // 下一个优先级的调度类
  .enqueue_task   = enqueue_task_fair,     // 进入就绪队列
  .dequeue_task   = dequeue_task_fair,  // 出队
  .yield_task   = yield_task_fair,    
  .yield_to_task    = yield_to_task_fair,

  .check_preempt_curr = check_preempt_wakeup,   // 检查当前进程是否可抢占

  .pick_next_task   = pick_next_task_fair,     // 选择下一个进程
  .put_prev_task    = put_prev_task_fair,  

#ifdef CONFIG_SMP
  .select_task_rq   = select_task_rq_fair,
  .migrate_task_rq  = migrate_task_rq_fair,

  .rq_online    = rq_online_fair,
  .rq_offline   = rq_offline_fair,

  .task_waking    = task_waking_fair,
  .task_dead    = task_dead_fair,
  .set_cpus_allowed = set_cpus_allowed_common,
#endif

  .set_curr_task          = set_curr_task_fair,
  .task_tick    = task_tick_fair,       // 这个就是周期性调度器调用的函数
  .task_fork    = task_fork_fair,

  .prio_changed   = prio_changed_fair,
  .switched_from    = switched_from_fair,
  .switched_to    = switched_to_fair,

  .get_rr_interval  = get_rr_interval_fair,

  .update_curr    = update_curr_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
  .task_move_group  = task_move_group_fair,
#endif
};

注释的几个都是比较重要的，在后面会详细讲的，其他的可能就不介绍了，内核的代码细节非常多，就不一一介绍了，主要是我也不懂，哈哈哈。等以后有时间了，或者看懂了，才分析吧，这次主要目标，还是内核初探，相当于来一个第一次认识。

19.2 周期性调度

接下来我们来看看周期性调度，我们在这一篇重学计算机（十七、linux调度器和调度器类）已经写了内核中会周期性调用scheduler_tick()函数，然后也分析也这个函数了，这次重点是分析这个代码：

curr->sched_class->task_tick(rq, curr, 0);

这个代码就会调用自己实现的task_tick函数指针，CFS完全公平调度器实现的函数是这个：

task_tick_fair()

接下来就分析这个函数。

19.2.1 周期性函数task_tick_fair()

/*
 * scheduler tick hitting a task of our scheduling class:
 调度程序点击调度类的一个任务:
 * rq：cpu运行对列
 * curr ：当前进程
 curr->sched_class->task_tick(rq, curr, 0);
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;    // 直接获取当前进程调度实体

    // 又是组调度的鬼东西
    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);     // 获取当前进程所在的cfs对列
        entity_tick(cfs_rq, se, queued);  // 时间处理函数，重点函数
    }

    if (static_branch_unlikely(&sched_numa_balancing))
        task_tick_numa(rq, curr);           // 这个我也不知道在干啥
}

19.2.2 时间处理函数entity_tick()

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
     * Update run-time statistics of the 'current'.
     更新“当前”的运行时统计信息
     */
    // 更新进程的虚拟时间，更新统计信息，在上一节详细分析的
    update_curr(cfs_rq);

    /*
     * Ensure that runnable average is periodically updated.
     确保定期更新可运行平均值。
     */
    update_load_avg(curr, 1);
    update_cfs_shares(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK    // 高精度时钟
    /*
     * queued ticks are scheduled to match the slice, so don't bother
     * validating it and just reschedule.
     排队的节拍被安排以匹配切片，所以不要费心验证它，只需重新安排
     */
    if (queued) {
        resched_curr(rq_of(cfs_rq));
        return;
    }
    /*
     * don't let the period tick interfere with the hrtick preemption
     不要让周期tick干扰hrtick抢占
     */
    if (!sched_feat(DOUBLE_TICK) &&
            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
        return;
#endif

    // 如果就绪进程个数大于1，检查是否可以抢占当前进程
    if (cfs_rq->nr_running > 1)
        check_preempt_tick(cfs_rq, curr);    // 判断当前进程是否需要被抢占
}

19.2.3 判断当前进程是否被抢占check_preempt_tick()

/*
 * Preempt the current task with a newly woken task if needed:
 如果需要，用一个新唤醒的任务抢占当前的任务
 * cfs_rq: 当前运行得对列
 * curr ： 指向当前进程
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    struct sched_entity *se;
    s64 delta;

    /* ideal_runtime记录进程应该运行的时间，之前分析的在这里调用的 */
    ideal_runtime = sched_slice(cfs_rq, curr);      // 计算实际进程运行的时间
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;   // 实际运行的时间
    // 判断进程的运行时间，是不是已经大于时间片
    if (delta_exec > ideal_runtime) {   // 这里是用两个实际运行时间比较，而不是用虚拟时间比较
        /* 这个就是设置一个标记TIF_NEED_RESCHED */
        resched_curr(rq_of(cfs_rq));    // 这个是设置是否需要进行进程切换的标记，可以分析
        /*
         * The current task ran long enough, ensure it doesn't get
         * re-elected due to buddy favours.
         当前的任务已经持续了足够长的时间，确保它不会因为好友的支持而再次当选。
         */
        clear_buddies(cfs_rq, curr);
        return;
    }

    /*
     * Ensure that a task that missed wakeup preemption by a
     * narrow margin doesn't have to wait for a full slice.
     * This also mitigates buddy induced latencies under load.
     */
     /* 如果当前进程运行时间低于调度的最小粒度，则不允许发生抢占 */
    if (delta_exec < sysctl_sched_min_granularity)
        return;

    // 原来在这里，之前第一次分析都忽略了
    se = __pick_first_entity(cfs_rq);       // 获取当前虚拟时间最小的进程
    delta = curr->vruntime - se->vruntime;  // 进行比较 

    if (delta < 0)  // 如果是当前进程比较小，就返回
        return;
    // 如果还有进程比当前进程虚拟时间还小，并且差值已经大于当前进程的实际运行时间，就可以切换了
    if (delta > ideal_runtime)  // ideal_runtime实际运行时间
        resched_curr(rq_of(cfs_rq));
}

19.2.4 记录需要切换进程标记resched_curr()

/*
 * resched_curr - mark rq's current task 'to be rescheduled now'.
 Resched_curr—标记运行队列中的当前任务为：“现在要重新调度”。
 *
 * On UP this means the setting of the need_resched flag, on SMP it
 * might also involve a cross-CPU call to trigger the scheduler on
 * the target CPU.
  在UP上，这意味着需要设置need_resched标志，在SMP上，它还可能涉及到跨CPU调用来触发目标CPU上的调度程序。
 */
void resched_curr(struct rq *rq)
{
  struct task_struct *curr = rq->curr;
  int cpu;

  lockdep_assert_held(&rq->lock);

  if (test_tsk_need_resched(curr))    // 这个就是设置了TIF_NEED_RESCHED的切换标记
    return;

  cpu = cpu_of(rq);

  if (cpu == smp_processor_id()) {
    set_tsk_need_resched(curr);
    set_preempt_need_resched();
    return;
  }

  if (set_nr_and_not_polling(curr))
    smp_send_reschedule(cpu);
  else
    trace_sched_wake_idle_without_ipi(cpu);
}

周期性调度，只是检测进程的运行时间是否已经达到了时间片的时间，如果达到了，也没有直接切换，只是设置了一个标记TIF_NEED_RESCHED，在后面我们就需要细心一下，查看是在哪里会进行进程切换，这里剧透一下，是在主调度器中，哈哈。

19.2.5 总结

总结就是画一下调用的流程图哈哈。

重学计算机（十九、CFS完全公平调度器＜中：周期调度和新进程调度＞）_CFS

总结一个图，好记忆，不像第一次那样，写完就忘记了。

19.3 新进程加入

上面我们介绍的是周期性调度器，下面我介绍一下fork一个新进程的时候，是怎么加入到调度器的，之前没有安排写fork函数分析，是有点亏了，不过都这样了fork在介绍了这个之后在补上，今天就挑出在fork函数中跟调度器相关的代码分析。

do_fork()其实跟调度器相关的有两个函数，一个是：

/* Perform scheduler related setup. Assign this task to a CPU.
执行调度器相关的设置。将此任务分配给CPU处理*/
retval = sched_fork(clone_flags, p);   // 设置调度器相关参数

wake_up_new_task(p);    // 还有的是这个，唤醒新进程

我们一个一个来分析

19.3.1 调度器参数初始化sched_fork()

/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
    unsigned long flags;
    int cpu = get_cpu();

    // 初始化调度相关值，如调度实体、运行时间、虚拟运行时间
    __sched_fork(clone_flags, p);
    /*
     * We mark the process as running here. This guarantees that
     * nobody will actually run it, and a signal or other external
     * event cannot wake it up and insert it on the runqueue either.
     我们在这里将进程标记为正在运行。这保证了没有人会真正运行它，
     并且信号或其他外部事件也不能唤醒它并将其插入运行队列中
     */
    p->state = TASK_RUNNING;

    /*
     * Make sure we do not leak PI boosting priority to the child.
     确保我们不会泄露孩子的优先级
     */
    p->prio = current->normal_prio;     // 父进程等于子进程优先级

    /*
     * Revert to default priority/policy on fork if requested.
     如果请求，在fork上恢复为默认的优先级/策略。
     */
     // 这个应该是恢复默认的值，应该没说错，哈哈哈
    if (unlikely(p->sched_reset_on_fork)) {
        if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
            p->policy = SCHED_NORMAL;
            p->static_prio = NICE_TO_PRIO(0);
            p->rt_priority = 0;
        } else if (PRIO_TO_NICE(p->static_prio) < 0)
            p->static_prio = NICE_TO_PRIO(0);

        p->prio = p->normal_prio = __normal_prio(p);
        set_load_weight(p);     // 设置优先级的值

        /*
         * We don't need the reset flag anymore after the fork. It has
         * fulfilled its duty:
         fork之后，我们不再需要重置标志了
         */
        p->sched_reset_on_fork = 0;
    }

    if (dl_prio(p->prio)) {    
        put_cpu();
        return -EAGAIN;
    } else if (rt_prio(p->prio)) {      // 初始化调度器类
        p->sched_class = &rt_sched_class;
    } else {
        p->sched_class = &fair_sched_class;     // 如果不是实时进程，则调度类为完全公平调度类
    }

    // 上面刚赋值完成，现在就赶紧调用了，现在才是调用调度类中的task_fork函数
    if (p->sched_class->task_fork)
        p->sched_class->task_fork(p);

    /*
     * The child is not yet in the pid-hash so no cgroup attach races,
     * and the cgroup is pinned to this child due to cgroup_fork()
     * is ran before sched_fork().
     *
     * Silence PROVE_RCU.

     子节点还没有在pid-hash中，因此没有cgroup附加竞争，
     并且cgroup被固定到该子节点上，因为cgroup_fork()在sched_fork()之前运行。
     */
     // 这个应该是多核处理，需要把该进程放在哪个cpu队列中
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    set_task_cpu(p, cpu);
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#ifdef CONFIG_SCHED_INFO
    if (likely(sched_info_on()))
        memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
    p->on_cpu = 0;
#endif
    init_task_preempt_count(p);     // x86好像没做处理
#ifdef CONFIG_SMP
    plist_node_init(&p->pushable_tasks, MAX_PRIO);
    RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif

    put_cpu();
    return 0;
}

基本理解成调用__sched_fork()这个继续初始化，然后就初始化优先级，然后通过优先级设置对应的调度类，设置完调度类之后，就开始调用调度类的p->sched_class->task_fork§函数。

19.3.2 继续初始化__sched_fork()

/*
 * Perform scheduler related setup for a newly forked process p.
  设置调度器相关的来自新进程
 * p is forked by current.
  p 是子进程
 *
 * __sched_fork() is basic setup used by init_idle() too:
  __sched_fork()也是init_idle()使用的基本设置
 */
static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{
  // 全部初始化为0
  p->on_rq      = 0;

  p->se.on_rq     = 0;
  p->se.exec_start    = 0;
  p->se.sum_exec_runtime    = 0;
  p->se.prev_sum_exec_runtime = 0;
  p->se.nr_migrations   = 0;
  p->se.vruntime      = 0;
  INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_SCHEDSTATS
  memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif

  // 初始化deadline
  RB_CLEAR_NODE(&p->dl.rb_node);
  init_dl_task_timer(&p->dl);
  __dl_clear_params(p);

  // 初始化rt
  INIT_LIST_HEAD(&p->rt.run_list);

#ifdef CONFIG_PREEMPT_NOTIFIERS
  INIT_HLIST_HEAD(&p->preempt_notifiers);   // 抢占通知？
#endif

// 这个忽略，啊哈哈
#ifdef CONFIG_NUMA_BALANCING
  if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
    p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
    p->mm->numa_scan_seq = 0;
  }

  if (clone_flags & CLONE_VM)
    p->numa_preferred_nid = current->numa_preferred_nid;
  else
    p->numa_preferred_nid = -1;

  p->node_stamp = 0ULL;
  p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
  p->numa_scan_period = sysctl_numa_balancing_scan_delay;
  p->numa_work.next = &p->numa_work;
  p->numa_faults = NULL;
  p->last_task_numa_placement = 0;
  p->last_sum_exec_runtime = 0;

  p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}

好像也没有做啥，就是初始化一些值。

19.3.3 CFS完全公平调度器新进程初始化(*task_fork)()

接下来我们看看CFS完全公平调度器对新进程是怎么初始化的。

/*
 * called on fork with the child task as argument from the parent's context
  在fork上调用，子任务作为来自父任务上下文的参数
 *  - child not yet on the tasklist
  任务列表上还没有子任务
 *  - preemption disabled
  抢占禁用
 */
static void task_fork_fair(struct task_struct *p)   // p是子进程
{
  struct cfs_rq *cfs_rq;
  struct sched_entity *se = &p->se, *curr;
  int this_cpu = smp_processor_id();
  struct rq *rq = this_rq();    // 获取该CPU上的对列
  unsigned long flags;

  raw_spin_lock_irqsave(&rq->lock, flags);

  update_rq_clock(rq);

  cfs_rq = task_cfs_rq(current);
  curr = cfs_rq->curr;    // 当前调度实体，应该是父进程的

  /*
   * Not only the cpu but also the task_group of the parent might have
   * been changed after parent->se.parent,cfs_rq were copied to
   * child->se.parent,cfs_rq. So call __set_task_cpu() to make those
   * of child point to valid ones.
   */
  rcu_read_lock();
  __set_task_cpu(p, this_cpu);
  rcu_read_unlock();

  /* 更新父进程当前统计 */
  update_curr(cfs_rq);   // 又是这个函数，很重要

  // 子进程的vruntime赋值为父进程的值
  if (curr)
    se->vruntime = curr->vruntime;    // se是子进程，curr是父进程，父进程把自己的虚拟时间给了子进程
  place_entity(cfs_rq, se, 1);      // 这个是初始化子进程的虚拟运行时间初值，很重要

  /* 如果设置了子进程先运行，并且父进程的vruntime小于子进程，则交换彼此的vruntime，确保子进程先运行 */
  if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
    /*
     * Upon rescheduling, sched_class::put_prev_task() will place
     * 'current' within the tree based on its new key value.
     */
    swap(curr->vruntime, se->vruntime);   // 进行互换
    resched_curr(rq);     // 设置切换标记
  }

  /* 此处先减去最小虚拟运行时间，等到加入队列的时候，进程会从新加上来 */
  se->vruntime -= cfs_rq->min_vruntime;   // 这个看到后面就会明白

  raw_spin_unlock_irqrestore(&rq->lock, flags);
}

19.3.4 调整子进程虚拟时间place_entity()

// initial=1 : 表示新创建的进程  initial=0:表示唤醒的进程
// se ： 是子进程
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;  // cfs对列中最小虚拟时间

    /*
     * The 'current' period is already promised to the current tasks,
     * however the extra weight of the new task will slow them down a
     * little, place the new task so that it fits in the slot that
     * stays open at the end.
     “当前”阶段已经被承诺给当前的任务，但是新任务的额外重量会让他们慢下来一点，
     把新任务放到最后的空位上
     */
     // START_DEBIT是给新创建的进程略加惩罚的
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se);

    /* sleeps up to a single latency don't count. */
    // 如果新创建的进程，是不会执行这里面的
    if (!initial) {
        unsigned long thresh = sysctl_sched_latency;

        /*
         * Halve their sleep time's effect, to allow
         * for a gentler effect of sleepers:
         */
        if (sched_feat(GENTLE_FAIR_SLEEPERS))
            thresh >>= 1;

        vruntime -= thresh;
    }

    /* ensure we never gain time by being placed backwards.
    确保我们永远不会因为被放置在后面而无法赢得时间*/
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}

如果没有START_DEBIT这个标记，那子进程的虚拟运行时间为：

se->vruntime = max_vruntime(父进程的虚拟运行时间, CFS运行队列的最小运行时间);

因为两个值都比较小，所以子进程很快就可以获得调度的机会。

那我们在哪里能查看是否设置了这个标记：

root@ubuntu:~# cat /sys/kernel/debug/sched_features 
GENTLE_FAIR_SLEEPERS <START_DEBIT> NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE RT_PUSH_IPI NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD 
root@ubuntu:~#

可以在sys文件下查看，明显我设置的这个系统是有这个标记。

如果设置START_DEBIT这个标记呢？

sched_vslice()这个函数我们在前面已经分析了，就是计算进程虚拟时间片。所以最后得到的虚拟时间片为：

se->vruntime = max_vruntime(父进程的虚拟运行时间, CFS运行队列的最小运行时间+进程虚拟时间片);

所以这种的话，子进程会被调度的慢一点。

19.3.5 唤醒新进程wake_up_new_task()

通过上面的学习，我们已经知道了进程初始化也了解到了子进程的虚拟时间是怎么设置了，那接下来我们就看看这么把子进程插入到红黑树中，让进程调度器可以找到这个子进程。

/*
 * wake_up_new_task - wake up a newly created task for the first time.
 Wake_up_new_task—第一次唤醒新创建的任务
 *
 * This function will do some initial scheduler statistics housekeeping
 * that must be done for every newly created context, then puts the task
 * on the runqueue and wakes it.
 这个函数将对每个新创建的上下文执行一些初始的调度程序统计家务，
 然后将任务放到运行队列中并唤醒它
 */
void wake_up_new_task(struct task_struct *p)
{
  unsigned long flags;
  struct rq *rq;

  raw_spin_lock_irqsave(&p->pi_lock, flags);
  /* Initialize new task's runnable average
  初始化新任务的可运行平均值*/
  init_entity_runnable_average(&p->se);
#ifdef CONFIG_SMP
  /*
   * Fork balancing, do it here and not earlier because:
   负载均衡，在这里做，而不是更早，因为
   *  - cpus_allowed can change in the fork path
   Cpus_allowed更改fork 路径
   *  - any previously selected cpu might disappear through hotplug
   任何之前选择的CPU可能会通过热插拔消失
   */
   // 在子进程还没进入队列之前，赶紧选择一个合适的CPU，做负载均衡
  set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));

  // 迁移的时候，可能会出现不同队列中的min_vruntime不同，如果当前进程
  // vruntime比较小的话，迁移过去就会占了别人很大的偏移，所以我们在上一个函数
  // 的时候 用 se->vruntime -= cfs_rq->min_vruntime; 减去当前cpu队列中的最小虚拟运行时间
  // 迁移成功后，再加上迁移后cpu队列的最小虚拟运行时间加上。
#endif

  rq = __task_rq_lock(p);
  activate_task(rq, p, 0);   // 这个就是重点，入队操作
  p->on_rq = TASK_ON_RQ_QUEUED; // 在队列中，不是在迁移
  trace_sched_wakeup_new(p);
  check_preempt_curr(rq, p, WF_FORK);   // 这个也是重点，判断当前进程是否可抢占
#ifdef CONFIG_SMP
  if (p->sched_class->task_woken) {
    /*
     * Nothing relies on rq->lock after this, so its fine to
     * drop it.
     */
    lockdep_unpin_lock(&rq->lock);
    p->sched_class->task_woken(rq, p);
    lockdep_pin_lock(&rq->lock);
  }
#endif
  task_rq_unlock(rq, p, &flags);
}

这个函数主要是在子进程还没加入就绪红黑树之前，先做一次负载均衡，这个时机是最合适的。然后在activate_task()函数里面加入红黑树，最后调用check_preempt_curr()这个函数判断是否抢占。

19.3.6 进队准备activate_task()

这个还真不知道起啥名字好，就直接起这个名字了。

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
  if (task_contributes_to_load(p))
    rq->nr_uninterruptible--;

  enqueue_task(rq, p, flags);   // 进入队列
}

这个函数也比较简单，就是调用enqueue_task这个函数的。

19.3.7 进入对列enqueue_task()

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
  update_rq_clock(rq);
  if (!(flags & ENQUEUE_RESTORE))
    sched_info_queued(rq, p);
  p->sched_class->enqueue_task(rq, p, flags);   // 又看到我们熟悉的函数指针了
}

这个函数好像也没做啥，只要是调用我们调度器里面的enqueue_task函数，现在又回到我们CFS完全公平调度器上了。

19.3.8 进入队列enqueue_task_fair()

我们来看一下真正进入对列的操作。

/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 在增加nr_running之前调用enqueue_task方法
 在这里，我们更新公平调度统计数据，然后将任务放入rbtree中

 */
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
  struct cfs_rq *cfs_rq;
  struct sched_entity *se = &p->se;

  for_each_sched_entity(se) {
    if (se->on_rq)      // 如果在就绪队列里，就退出
      break;
    cfs_rq = cfs_rq_of(se);
    enqueue_entity(cfs_rq, se, flags);    // 进队操作

    /*
     * end evaluation on encountering a throttled cfs_rq
      在遇到经过节流的cfs_rq时结束评估
     *
     * note: in the case of encountering a throttled cfs_rq we will
     * post the final h_nr_running increment below.
     注意:在遇到经过节流的cfs_rq的情况下，我们将在下面发布最终的h_nr_running增量
    */
    if (cfs_rq_throttled(cfs_rq))   // CFS带宽控制，不知道
      break;
    cfs_rq->h_nr_running++;

    flags = ENQUEUE_WAKEUP;     
  }

  // 这里为啥又搞一次，不明白
  for_each_sched_entity(se) {
    cfs_rq = cfs_rq_of(se);
    cfs_rq->h_nr_running++;

    if (cfs_rq_throttled(cfs_rq))
      break;

    update_load_avg(se, 1);
    update_cfs_shares(cfs_rq);
  }

  if (!se)
    add_nr_running(rq, 1);    // 这个加1就这么严谨，h_nr_running+1就太随便了点

  hrtick_update(rq);    // 高精度定时器，不过不想看了
}

添加到红黑树的结点，就是复杂，调用了这么多函数，现在还在调用，等到总结的时候，画一下，看看调用多少函数。

19.3.9 插入红黑树enqueue_entity()

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
  /*
   * Update the normalized vruntime before updating min_vruntime
   * through calling update_curr().
   通过调用update_curr()在更新min_vruntime之前更新规范化的vruntime
   */
   // flags我追踪了一下，发现是0，所以这个判断是能进入的，
   // 然后把之前减去的最小虚拟运行时间加回来
  if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
    se->vruntime += cfs_rq->min_vruntime;   // 看到这里，不知道大家还记得不得当前减去的虚拟时间，这里就从新加了回来。

  /*
   * Update run-time statistics of the 'current'.
   更新“当前”的运行时统计信息。
   */
  update_curr(cfs_rq);    // 继续更新时间，这个函数调用的次数好多
  enqueue_entity_load_avg(cfs_rq, se);
  account_entity_enqueue(cfs_rq, se);
  update_cfs_shares(cfs_rq);

  if (flags & ENQUEUE_WAKEUP) {     // 这个感觉是唤醒的标记
    place_entity(cfs_rq, se, 0);
    enqueue_sleeper(cfs_rq, se);
  }

  update_stats_enqueue(cfs_rq, se);
  check_spread(cfs_rq, se);
  if (se != cfs_rq->curr)
    __enqueue_entity(cfs_rq, se);   // 这个就是红黑树插入结点的操作，上一节分析过，这里就不分析了，不懂可以看上一节
  se->on_rq = 1;    // 这个已经进入队列

  if (cfs_rq->nr_running == 1) {
    list_add_leaf_cfs_rq(cfs_rq);
    check_enqueue_throttle(cfs_rq);
  }
}

这个函数重要一点，把之前减去的值重新加回来，然后在插入到红黑树的结点中。这样就等待CFS完全公平调度器来调度了。

19.3.10 检测是否需要抢占check_preempt_curr()

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
  const struct sched_class *class;

  // 如果这个进程也是跟现在队列中运行的进程的调度类是一样的，就可以判断是否可以抢占
  if (p->sched_class == rq->curr->sched_class) {
    rq->curr->sched_class->check_preempt_curr(rq, p, flags);
  } else {
    for_each_class(class) {
      // 如果进程等于或低于调度类的时候，就直接退出
      if (class == rq->curr->sched_class)
        break;
      // 如果进程所属的调度类优先级高于当前进程，那就设置need_resched标记
      if (class == p->sched_class) {
        resched_curr(rq);    // 设置标记的
        break;
      }
    }
  }

  /*
   * A queue event has occurred, and we're going to schedule.  In
   * this case, we can save a useless back to back clock update.
   */
  if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
    rq_clock_skip_update(rq, true);
}

19.3.11 直接抢占函数指针(*check_preempt_curr)()

.check_preempt_curr = check_preempt_wakeup,

这个函数指针在CFS完全公平调度器中是指向了这个函数，接下来我们看看在CFS调度器中是怎么进行切换的。

/*
 * Preempt the current task with a newly woken task if needed:
 如果需要，用一个新唤醒的任务抢占当前任务:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
  struct task_struct *curr = rq->curr;
  struct sched_entity *se = &curr->se, *pse = &p->se;
  struct cfs_rq *cfs_rq = task_cfs_rq(curr);
  int scale = cfs_rq->nr_running >= sched_nr_latency;
  int next_buddy_marked = 0;

  if (unlikely(se == pse))
    return;

  /*
   * This is possible from callers such as attach_tasks(), in which we
   * unconditionally check_prempt_curr() after an enqueue (which may have
   * lead to a throttle).  This both saves work and prevents false
   * next-buddy nomination below.
    这可能来自attach_tasks()等调用者，其中我们无条件地在enqueue(这可能会导致节流)
    之后执行check_prempt_curr()。这既节省了工作，也防止了下面的虚假的下一个好友提名
   */
  if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
    return;

  if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
    set_next_buddy(pse);
    next_buddy_marked = 1;
  }

  /*
   * We can come here with TIF_NEED_RESCHED already set from new task
   * wake up path.
   *
   * Note: this also catches the edge-case of curr being in a throttled
   * group (e.g. via set_curr_task), since update_curr() (in the
   * enqueue of curr) will have resulted in resched being set.  This
   * prevents us from potentially nominating it as a false LAST_BUDDY
   * below.
   我们可以来到这里，TIF_NEED_RESCHED已经从新任务唤醒路径设置
   */
  if (test_tsk_need_resched(curr))    // 如果设置了标记就返回
    return;

  /* Idle tasks are by definition preempted by non-idle tasks. */
  if (unlikely(curr->policy == SCHED_IDLE) &&    // 如果当前进程是空闲的，可以直接跳转
      likely(p->policy != SCHED_IDLE))
    goto preempt;

  /*
   * Batch and idle tasks do not preempt non-idle tasks (their preemption
   * is driven by the tick):
   批处理和空闲任务不会抢占非空闲任务(它们的抢占是由tick驱动的)
   */
  if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
    return;

  find_matching_se(&se, &pse);
  update_curr(cfs_rq_of(se));   // 还在更新
  BUG_ON(!pse);
  if (wakeup_preempt_entity(se, pse) == 1) {    // 这个重要，判断是否抢占
    /*
     * Bias pick_next to pick the sched entity that is
     * triggering this preemption.
     Bias pick_next选择触发此抢占的sched实体
     */
    if (!next_buddy_marked)
      set_next_buddy(pse);
    goto preempt;   // 满足就跳转到设置标记
  }

  return;

preempt:
  resched_curr(rq);   // 设置标记
  /*
   * Only set the backward buddy when the current task is still
   * on the rq. This can happen when a wakeup gets interleaved
   * with schedule on the ->pre_schedule() or idle_balance()
   * point, either of which can * drop the rq lock.
   *
   * Also, during early boot the idle thread is in the fair class,
   * for obvious reasons its a bad idea to schedule back to it.
   */
  if (unlikely(!se->on_rq || curr == rq->idle))
    return;

  if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
    set_last_buddy(se);
}

这个函数只要是做唤醒工作的，不管是新进程加入的时候，还是睡眠唤醒都是调用了这个函数，说是唤醒，其实也是做进程切换的标记。

怎么判断是否需要添加标记，就需要分析这个函数wakeup_preempt_entity()

19.3.12 判断是否需要切换wakeup_preempt_entity()

/*
 * Should 'se' preempt 'curr'.
 *
 *             |s1
 *        |s2
 *   |s3
 *         g
 *      |<--->|c
 *
 *  w(c, s1) = -1
 *  w(c, s2) =  0
 *  w(c, s3) =  1
 *
 */
static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
  s64 gran, vdiff = curr->vruntime - se->vruntime;   

  if (vdiff <= 0)  // 说明是curr->vruntime > se->vruntime
    return -1;

  gran = wakeup_gran(curr, se); // 这个函数才是真正比较的  
  if (vdiff > gran) // 运行时间差大于唤醒抢占粒度的虚拟时间的时候就可以切换了
    return 1;

  return 0;
}

这个函数也不是最后调用的函数，内核的代码就是层次复杂。

19.3.12 真正计算函数wakeup_gran()

static unsigned long
wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
{
  unsigned long gran = sysctl_sched_wakeup_granularity;   // 看来是有有一个唤醒抢占粒度

  /*
   * Since its curr running now, convert the gran from real-time
   * to virtual-time in his units.
   因为它的当前运行，现在，转换格兰从实时到虚拟时间在他的单位
   *
   * By using 'se' instead of 'curr' we penalize light tasks, so
   * they get preempted easier. That is, if 'se' < 'curr' then
   * the resulting gran will be larger, therefore penalizing the
   * lighter, if otoh 'se' > 'curr' then the resulting gran will
   * be smaller, again penalizing the lighter task.
   *
   * This is especially important for buddies when the leftmost
   * task is higher priority than the buddy.
   当最左边的任务比同伴的优先级高时，这对同伴来说尤其重要。
   */
  return calc_delta_fair(gran, se);   // 唤醒抢占粒度的虚拟时间
}

这个函数又引进了一个参数，唤醒抢占粒度sysctl_sched_wakeup_granularity，唤醒抢占粒度也是可以通过查看porc文件系统的：

root@ubuntu:~# cat /proc/sys/kernel/sched_wakeup_granularity_ns 
2000000

单位是ns，转化成2ms，我这个系统的唤醒抢占粒度是2ms，所以也不会怕唤醒的进程就会立刻抢占。

但是如果系统的唤醒抢占太过频繁，我们可以通过调整sched_wakeup_granularity_ns这个值来解决，这个值越大，抢占就越不容易。注意：sched_wakeup_granularity_ns值不能超过调度周期sched_latency_ns的一半，否则的话，就想当与禁止唤醒抢占了。