负载均衡(Load balance),是为了将执行task的工作量较平均地配分到每个cpu上,达到功耗和性能平衡的一种机制。比如很多task都只放在cpu0上执行,既不能保证节省功耗(因为负载中,需要提升cpu频率;整体执行时间长),也不能保证task及时执行,从而导致卡顿,且耗电。而有了负载均衡就会触发task迁移,按照一定的规则,将task较合理地分配给每个cpu,既能保证功耗,也可以提升性能。

负载均衡主要的操作是以下2种:

1、pull:负载轻的CPU,从负载繁重的CPU pull tasks来。这是主要的load balance方式,因为不该让负载本身就繁重的CPU执行负载均衡任务。

2、push:负载重的CPU,向负载轻的CPU。这种操作也叫做active load balance。

 

 

spark 负载均衡开关 负载均衡pool_spark 负载均衡开关

负载均衡,在不同的情况下,大致分以下几种:

busy balance(periodic balance):cpu上有task运行,是否需要做load balance。

idle balance:指cpu已经进入idle状态,在tick时做的load balance。如果是NO HZ的话,会尝试做nohz idle balance。

newly idle balance:CPU上没有可运行的task,准备进入idle 的状态。在这种情况下,scheduler会尝试从别的CPU上pull一些进程过来运行。

active load balance:在尝试load balance几次失败后,根据条件判断进行active balance。会从负载重的cpu,向负载轻的cpu,推送(push)task。(执行push的是running的task)

nohz idle balance:检查当前CPU是否负载过高等等情况,是否需要唤醒一个idle CPU让其pull 一些进程过去,分担一下负载压力。唤醒会通过IPI_RESCHEDULE 让 idle cpu退出power collapse。

 

本来打算一篇文章把所有触发的各种balance都写完,但是发现内容有点多,所以需要拆分一下。目前这篇先主要分析在定时调度触发的load balance(periodic)及其相关调用流程。 其余后续再补充。

本文代码基于CAF-kernel-4.19和CAF-kernel-5.4。由于目前由于理解还不熟,不免有错误之处,烦请指正。

 

Periodic Load Balance

 在scheduler_tick中,会周期性检查和触发load balance:

scheduler_tick()
{
。。。
#ifdef CONFIG_SMP
    rq->idle_balance = idle_cpu(cpu);
    trigger_load_balance(rq);       //触发load balance
#endif   
。。。 
}

在 trigger_load_balance() 中,会触发一个SOFTIRQ。之后的 nohz_balancer_kick()函数为nohz idle balance工作,后面再分析。

/*
 * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
 */
void trigger_load_balance(struct rq *rq)
{
    /* Don't need to rebalance while attached to NULL domain or
     * cpu is isolated.
     */
    if (unlikely(on_null_domain(rq)) || cpu_isolated(cpu_of(rq)))
        return;

    if (time_after_eq(jiffies, rq->next_balance))
        raise_softirq(SCHED_SOFTIRQ);           //触发SOFTIRQ中断

    nohz_balancer_kick(rq);         //属于nohz idle balance,暂不分析
}

那么 softirq 触发之后,会调用到哪里呢?首先看这个 SOFTIRQ 注册的地方,在fair调度class初始化的时候,有对 SOFTIRQ 中断进行注册,实际会触发run_rebalance_domains()函数。

__init void init_sched_fair_class(void)
{
#ifdef CONFIG_SMP
    open_softirq(SCHED_SOFTIRQ, run_rebalance_domains); //注册SOFTIRQ中断处理

#ifdef CONFIG_NO_HZ_COMMON
    nohz.next_balance = jiffies;
    nohz.next_blocked = jiffies;
    zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
#endif
#endif /* SMP */

}

所以触发SOFTIRQ后,就会调用处理函数run_rebalance_domains()。

/*
 * run_rebalance_domains is triggered when needed from the scheduler tick.
 * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
 */
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
    struct rq *this_rq = this_rq();
    enum cpu_idle_type idle = this_rq->idle_balance ?
                        CPU_IDLE : CPU_NOT_IDLE;

    /*
     * Since core isolation doesn't update nohz.idle_cpus_mask, there
     * is a possibility this nohz kicked cpu could be isolated. Hence
     * return if the cpu is isolated.
     */
    if (cpu_isolated(this_rq->cpu))     //过滤已经isolate的cpu
        return;

    /*
     * If this CPU has a pending nohz_balance_kick, then do the
     * balancing on behalf of the other idle CPUs whose ticks are
     * stopped. Do nohz_idle_balance *before* rebalance_domains to
     * give the idle CPUs a chance to load balance. Else we may
     * load balance only within the local sched_domain hierarchy
     * and abort nohz_idle_balance altogether if we pull some load.
     */
    if (nohz_idle_balance(this_rq, idle))   //为已经进入idle的所有cpu进行负载均衡,因为idle cpu的调度tick已经停止了。(这块后面分析)
        return;                             //同时,可以在做调度域均衡前,先给idle cpu进行负载均衡;否则我们仅仅会在所在的调度域这一层内完成负载均衡,如果我们pull出来一些load的话,也会中止nohz_idle_balance。

    /* normal load balance */
    update_blocked_averages(this_rq->cpu);      //计算当前cpu rq及其中各层cfs rq的相关负载load
    rebalance_domains(this_rq, idle);           //(1)检查各层sd是否要进行load balance
}

(1)在 rebalance_doamins() 中,遍历当前cpu的sd(从当前sd到sd->parent),并且:

1.对每层sd的newidle_lb_cost进行老化,按照1%/s(每秒衰减1%)的速度。并且把所有sd的值都累加起来,作为rq的newidle_lb_cost,但是它有最小值限制,不会低于50ms。(暂时还不知道newidle_lb_cost这个有什么用??)

2. 对每一层sd判断是否需要进行load balance,并根据结果启动balance动作。遍历顺序:MC->DIE

3. 每个rq级别balance检查间隔,默认为 60s。如果sd的下次检查balance的时间,早于rq的检查时间,那么就会将rq的检查时间对齐到sd的检查时间。(也就是说,sd如果更早,那么rq也要跟着改到与sd一样的时刻检查)

4. 每个sd检查间隔为:sd->last_balance + sd->balance_interval(越高层的domain,间隔时间越大,因为迁移task的代价越大;)

5. 有一些情况会跳过当前sd的load balance检查:

  ---sd没有overutil,并且没有设置perfer spread idle(没有overutil表示系统没有出现严重过载;prefer spread idle是一个qcom feature,打开后会让task自由地迁移到当前cluster内的idle cpu上,以达到runnable task数量平均)

  ---sd->flags中没有设置SD_LOAD_BALANCE(表示没有load balance需求)

  ---continue_balance标志为0(表示sd中其他cpu的load balance更活跃,无需做重复操作)

  6. 根据需要,判断是否更新rq的下次balance时间;同时判断:如果当前cpu为idle状态,并且nohz.next_balance晚于rq->next_balance,满足则同时更新nohz.next_balance = rq->next_balance

/*
 * It checks each scheduling domain to see if it is due to be balanced,
 * and initiates a balancing operation if so.
 *
 * Balancing parameters are set up in init_sched_domains.
 */
static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
    int continue_balancing = 1;
    int cpu = rq->cpu;
    unsigned long interval;
    struct sched_domain *sd;
    /* Earliest time when we have to do rebalance again */
    unsigned long next_balance = jiffies + 60*HZ;        //下次做负载均衡的deadline,当前时间往后60s
    int update_next_balance = 0;
    int need_serialize, need_decay = 0;
    u64 max_cost = 0;

    rcu_read_lock();
    for_each_domain(cpu, sd)  {        //遍历当前cpu所在的domain,并向上(parent)遍历domain
        /*
         * Decay the newidle max times here because this is a regular
         * visit to all the domains. Decay ~1% per second.
         */
        if (time_after(jiffies, sd->next_decay_max_lb_cost)) {    //对max_newidle_lb_cost每一秒衰减1%:max_newidle_lb_cost*253/256
            sd->max_newidle_lb_cost =
                (sd->max_newidle_lb_cost * 253) / 256;
            sd->next_decay_max_lb_cost = jiffies + HZ;        //HZ表示1s
            need_decay = 1;
        }
        max_cost += sd->max_newidle_lb_cost;        //统计所有domain的cost之和

        if (!sd_overutilized(sd) && !prefer_spread_on_idle(cpu))    //调度域没有overutil 并且 没有prefer spread on idle,那么直接continue,不需要进行load balance操作
            continue;

        if (!(sd->flags & SD_LOAD_BALANCE))        //没有设置SD_LOAD_BALANCE flag,则表示不需要做load balance操作,当前平台MC、DIE level都有这个flag
            continue;

        /*
         * Stop the load balance at this level. There is another
         * CPU in our sched group which is doing load balancing more
         * actively.
         */
        if (!continue_balancing) {        //如果continue_balancing == 0,则要停止当前层的load balance操作,因为在sched group中已经有其他cpu在执行load balance操作了
            if (need_decay)
                continue;
            break;
        }

        interval = get_sd_balance_interval(sd, idle != CPU_IDLE);        //获取当前调度域load balance所需的时间。越高层的domain,间隔时间越大,因为迁移task的代价越大。
        need_serialize = sd->flags & SD_SERIALIZE;        //是否要串行化执行load balance,NUMA架构才有带有这个flag,所以这里为0
        if (need_serialize) {
            if (!spin_trylock(&balancing))       
                goto out;
        }

        if (time_after_eq(jiffies, sd->last_balance + interval)) {            //时间间隔满足当前调度域的间隔要求后,做实际load balance动作
            if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {    //(2)load balance核心函数
                /*
                 * The LBF_DST_PINNED logic could have changed        //遇到dst cpu被拴住(原因可能是cpuset)的情况,
                 * env->dst_cpu, so we can't know our idle            //可能会改变env->dst_cpu(会寻找new dst cpu),
                 * state even if we migrated tasks. Update it.        //所以我们丢失了当前cpu的idle状态,需要重新获取
                 */
                idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;    //重新更新cpu idle状态
            }
            sd->last_balance = jiffies;        //更新load balance的时间戳
            interval = get_sd_balance_interval(sd, idle != CPU_IDLE);            //获取sd中balance间隔(在load balance中被tune过;如果sd内任一cpu有misfit task,则间隔强制写为1个jiffies == 1个tick)
        }
        if (need_serialize)                            //如上面设置了需要串行执行load balance
            spin_unlock(&balancing);                //则这里就要释放锁
out:
        if (time_after(next_balance, sd->last_balance + interval)) {                //如果sd下一次balance时间在,rq的balance时间之前,                    【rq下次执行load balance的时间:next_balance(默认是60s)】
            next_balance = sd->last_balance + interval;                                //那么就需要更新rq下次balance时间,对齐到sd下一次balance时间                    【调度域下次执行load balance的时间:sd->last_balance + interval】
            update_next_balance = 1;                                                //最终rq下次balance时间会对齐到,多个rq中最近的balance时间
        }
    }
    if (need_decay) {
        /*
         * Ensure the rq-wide value also decays but keep it at a        //rq上的值会老化,但是要保证值在一个合理的范围内,
         * reasonable floor to avoid funnies with rq->avg_idle.         //需要有一个最低值限制,以免与rq->avg_idle不匹配,出现笑话
         */
        rq->max_idle_balance_cost =
            max((u64)sysctl_sched_migration_cost, max_cost);            //max_idle_balance_cost最小值不能低于sysctl_sched_migration_cost(50ms)
    }
    rcu_read_unlock();

    /*
     * next_balance will be updated only when there is a need.
     * When the cpu is attached to null domain for ex, it will not be
     * updated.
     */
    if (likely(update_next_balance)) {
        rq->next_balance = next_balance;            //更新rq的下次balance时间

#ifdef CONFIG_NO_HZ_COMMON
        /*
         * If this CPU has been elected to perform the nohz idle    //如果当前cpu被选择做nohz idle balance
         * balance. Other idle CPUs have already rebalanced with    //那么其他idle cpu已经通过nohz_idle_balance()进行了balance和更新nohz.next_balance时间戳
         * nohz_idle_balance() and nohz.next_balance has been      
         * updated accordingly. This CPU is now running the idle load  //当前cpu正在为自己运行idle load balance
         * balance for itself and we need to update the           //我们应该相应地更新nohz.next_balance时间戳
         * nohz.next_balance accordingly.
         */
        if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))    //如果当前cpu为idle状态,并且nohz.next_balance晚于rq->next_balance
            nohz.next_balance = rq->next_balance;            //更新nohz的下次balance时间
#endif
    }
}

(2)load_balance函数

主要框架:

spark 负载均衡开关 负载均衡pool_spark 负载均衡开关_02

执行流程示意图:

spark 负载均衡开关 负载均衡pool_spark 负载均衡开关_03

 

 

 代码流程:

  1. 初始化load balance env结构体:其中就是将当前的cpu作为light cpu(dst cpu),为loading重的cpu分担一些任务。确定load balance的调度组范围。

  在调度域中的cpu;2、并且也是active的cpu,两者的交集作为env->cpus

redo(一般这里会重新选择src cpu,即busiest)

out_balanced

  3. 找到最繁忙的group,并计算出不平衡值imbalance ----find_busiest_group();未找到busiest group,则goto out_balanced

out_balanced

  5. 初始化记录迁移的task数量ld_moved = 0

  6. 判断busiest rq上是否有超过1个task:

  • 如果有>1个task
  • 初始化flags,限制单次load balance只能遍历32个task

more_balance

  • 判断是否不满足条件做balance了?busiest rq上task <=1 或是 task <=2但在active balance,满足则清掉flag:LBF_ALL_PINNED,然后goto no_move
  • 更新busiest rq的clock时间信息
  • 进行迁移detach操作,从busiest rq剥离要迁移的task
  • 将迁移的task attach到dst rq上,并记录本次负载均衡,总的迁移task的数量
  • 判断env.flags是否有LBF_NEED_BREAK(该flag表示一次性detach检查的task数量超过了门限,且还没有抹平imbalance),清掉flag:LBF_NEED_BREAK,然后goto more_balance
  • 如果出现过由于cpuset限制了dst cpu,并重新选择了dst cpu,而且不平衡仍然存在:使用new dst cpu作为dst cpu,goto more_balance,重新检查detach和attach
  • 如果由于cpu亲和度导致不能达到平衡,则设置parent sd的sg type为group_imbalanced,相当于将不平衡传递上去
  • 如果因为cpu亲和度,而导致所有task都不能迁移到dst cpu:
  • 则去掉当前dst cpu,判断如果剩下的cpus不在dst cpu所在group内,那还有可能作为busiest,那么goto redo,重新选择busiest
  • 如果剩下cpus没有作为busiest的可能了,那么goto out_all_pinned

no_move

  • 如果到这里迁移的task总数(ld_moved)仍为0
  • 统计periodic balance情况下balance 失败的计数
  • 判断是否需要active load balance(更激进的balance,从busiest cpu push task出去)
  • 如果src cpu或者busiest cpu被标记为reserved了,那就不能进行balance了,goto out
  • 如果busiest 的curr进程被cpuset限制而不能迁移到当前cpu;或者busiest上的curr进程是开了boost strcit max,并且在rtg内,并且dst cpu的orig_capacity < src cpu的orig_capacity。则设置env.flags |= LBF_ALL_PINNED,goto out_one_pinned
  • 如果busiest不在做active balance,并且busiest cpu不处于isolate。那就设置busiest正在做active balance的标记:busiest->active_balance = 1;再指定push task的dst cpu为当前cpu;标记接下来要唤醒active balance进程(active_balance = 1);并将当前cpu标记为reserved
  • 判断是否要唤醒active balance进程(active_balance 是否为 1)
  • 满足,则开启busiest上的active_load_balance的工作队列(函数active_load_balance_cpu_stop,其是stop class的进程),设置不需要继续load balance标记:*continue_balancing = 0。-----active load balance主要是对running进程进行迁移,后续再写blog补充,这里不再详述。
  • 走到这里,说明我们已经触发了active balance。那就设置当前sd balance失败计数改为cache_nice_tries + 10 -1
  • 如果有迁移过task(ld_moved不为0),则清零balance失败的计数

7.  如果没触发active load balance,或者没有触发到主动active balance

  • 上述条件满足,则说明现在还是处于不平衡状态。那就要将触发balance的间隔缩短到最小:sd->min_interval
  • 如果条件不满足,说明已经触发了active balance了。如果触发balance间隔 < sd->max_interval,那就再将间隔放大到2倍(*2)

  8.    goto out

out_balanced

9.  判断当前是在MC level,并不是所有task都不能迁移的情况(即,尽管遇到了一些亲和度冲突,但是最终还是达到平衡了),则清掉parent sd的imbalance flag:group_imbalance = 0

out_all_pinned

10. 因为在当前这个level,所有task都被亲和度绑定了,所以我们无法迁移它们。则保留parent sd的imbalance flag。这里会顺便统计负载均衡的次数(以当前cpu idle状态区分统计)。同时,清零balance失败的计数。

out_one_pinned

11.  这里会清零总迁移task的计数ld_moved = 0

12.  如果是newly idle的情况,直接goto out

13.  如果出现所有task都由于亲和度都无法迁移,且做balance的间隔时间 < MAX_PINNED_INTERVAL(512);或者做balance的间隔时间 < sd->max_interval。那么就把间隔再放大一倍(*2)

out

14.  return ld_moved,即本次balance总迁移task数;但如果出现了某些异常情况,就return 0。

/*
 * Check this_cpu to ensure it is balanced within domain. Attempt to move
 * tasks if there is an imbalance.
 */
static int load_balance(int this_cpu, struct rq *this_rq,
            struct sched_domain *sd, enum cpu_idle_type idle,
            int *continue_balancing)
{
    int ld_moved = 0, cur_ld_moved, active_balance = 0;
    struct sched_domain *sd_parent = sd->parent;
    struct sched_group *group = NULL;
    struct rq *busiest = NULL;
    struct rq_flags rf;
    struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);  //这里是唯一使用该mask的地方,边使用边赋值,一个per-cpu变量。用于确定env->cpus

    struct lb_env env = {                    //填充load balance所需结构体
        .sd        = sd,                        //调度域为当前sd
        .dst_cpu    = this_cpu,                //task迁移的目标cpu为当前cpu
        .dst_rq        = this_rq,                //task迁移的目标rq为当前cpu rq
        .dst_grpmask    = sched_group_span(sd->groups),  //获取要做load balance sd对应的sg范围
        .idle        = idle,                    //task迁移的目标cpu(即当前cpu),它的idle状态
        .loop_break    = sched_nr_migrate_break,
        .cpus        = cpus,
        .fbq_type    = all,
        .tasks        = LIST_HEAD_INIT(env.tasks),        //初始化链表,后续要迁移的task会放进这个链表中
        .imbalance    = 0,
        .flags        = 0,
        .loop        = 0,
    };
                                                                    //prefer_spread设置:
    env.prefer_spread = (prefer_spread_on_idle(this_cpu) &&            //当前cpu的prefer spread idle是否设置,并且 下面条件不成立:
                !((sd->flags & SD_ASYM_CPUCAPACITY) &&                //有SD_ASYM_CPUCAPACITY,且 当前cpu不在asym_cap_sibling_cpus范围内
                 !cpumask_test_cpu(this_cpu,                        
                         &asym_cap_sibling_cpus)));

    cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);    //取出1、在调度域中的cpu;2、并且也是active的cpu,两者的交集作为env->cpus

    schedstat_inc(sd->lb_count[idle]);        //统计sd->lb_count[idle]++

redo:
    if (!should_we_balance(&env)) {            //(2-1)判断目标cpu是否可以作为light cpu进行load balance(light cpu说明任务较少,会从任务较重的cpu上pull任务过来)
        *continue_balancing = 0;
        goto out_balanced;                    //不满足条件的话,就清空continue_balancing = 0,跳过迁移过程
    }

    group = find_busiest_group(&env);        //(2-2)如果存在不均衡,那么就找到调度域中,本层级sched_group链表负载最重的(busiest)的sched_group:busiest的group。同时计算出负载均衡所需迁移的load
    if (!group) {
        schedstat_inc(sd->lb_nobusyg[idle]);        //如果没有找到busiest group,那就统计调度数据,之后退出
        goto out_balanced;
    }

    busiest = find_busiest_queue(&env, group);    //(2-3)找出busiest sched_group中,即负载最重cpu对应的rq
    if (!busiest) {
        schedstat_inc(sd->lb_nobusyq[idle]);
        goto out_balanced;
    }

    BUG_ON(busiest == env.dst_rq);            //如果迁移目标cpu rq是busiest的话,那就出问题了!

    schedstat_add(sd->lb_imbalance[idle], env.imbalance);        //累计统计不同idle状态下进程调度数据:env.imbalance 不平衡值

    env.src_cpu = busiest->cpu;        //将上面复杂流程找到的迁移的src cpu和对应src rq,填进env结构体中
    env.src_rq = busiest;

    ld_moved = 0;
    if (busiest->nr_running > 1) {            //判断是否有task可以进行迁移,至少要有1个task
        /*
         * Attempt to move tasks. If find_busiest_group has found            //尝试迁移task
         * an imbalance but busiest->nr_running <= 1, the group is           //如果find_busiest_group函数找到了不平衡,但是busiest->nr_running <= 1,那这个group还是不平衡的
         * still unbalanced. ld_moved simply stays zero, so it is            //ld_moved仍保持0,所这种情况当作不平衡是正确的
         * correctly treated as an imbalance.
         */
        env.flags |= LBF_ALL_PINNED;        //初始化flag,初始状态为:dst cpu上所有task都由于cpu亲和度拴住了,无法迁移
        env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);    //限制一次load balance遍历task的数量最大不超过32个,因为这个期间是关中断的

more_balance:
        rq_lock_irqsave(busiest, &rf);

        /*
         * The world might have changed. Validate assumptions.            //代码运行到这里时,有时状态已经发生了变化,所以需要再check:
         * And also, if the busiest cpu is undergoing active_balance,     //如果busiest cpu在做active balance,但是它的rq上只有 <= 2个task
         * it doesn't need help if it has less than 2 tasks on it.        //那么它不需要进行迁移
         */

        if (busiest->nr_running <= 1 ||                                    //busiest rq只有1个task或没有task
            (busiest->active_balance && busiest->nr_running <= 2)) {       //或者   (busiest rq处于active balance状态 并且 rq有<=2个task)
            rq_unlock_irqrestore(busiest, &rf);
            env.flags &= ~LBF_ALL_PINNED;                //清flag:LBF_ALL_PINNED
            goto no_move;                                //退出,不进行迁移 no_move
        }

        update_rq_clock(busiest);                    //更新busiest rq的clock时间信息

        /*
         * cur_ld_moved - load moved in current iteration                //cur_ld_moved:便是本次balance中一次迁移操作的load(task数量)
         * ld_moved     - cumulative load moved across iterations        //ld_moved:表示累计本次balance多次迁移,并已经迁移了的总load(总task数量)
         */
        cur_ld_moved = detach_tasks(&env);            //(2-4)进行迁移detach操作,从原rq脱离,返回的是迁移的task数量

        /*
         * We've detached some tasks from busiest_rq. Every             //我们已经从busest  rq上detach了一些task
         * task is masked "TASK_ON_RQ_MIGRATING", so we can safely      //每一个task都标记为:TASK_ON_RQ_MIGRATING
         * unlock busiest->lock, and we are able to be sure             //所以,我们可以安全地进行unlock busiest->lock
         * that nobody can manipulate the tasks in parallel.            //同时,我们可以保证没有人能并行地篡改它
         * See task_rq_lock() family for the details.
         */

        rq_unlock(busiest, &rf);

        if (cur_ld_moved) {                    //如果有task detach了,那么就需要进行attach到dst cpu上,完成迁移
            attach_tasks(&env);                //(2-5)将所有task attach到新的rq上
            ld_moved += cur_ld_moved;          //累加当前完成task迁移的数量,统计总迁移task数
        }

        local_irq_restore(rf.flags);

        if (env.flags & LBF_NEED_BREAK) {        //上面在detach时,loop超过一定门限,就会设置该flag。目前平台该门限设置与loop限制的最大次数相同,所以相当于没有break
            env.flags &= ~LBF_NEED_BREAK;
            goto more_balance;
        }

        /*
         * Revisit (affine) tasks on src_cpu that couldn't be moved to        //重新再检查src cpu上的那些不能迁移的task,
         * us and move them to an alternate dst_cpu in our sched_group        //将他们迁移到所在调度组内的其他可选dst cpu上,
         * where they can run. The upper limit on how many times we            //在同一个src cpu轮询的次数上限取决于调度组内cpu个数
         * iterate on same src_cpu is dependent on number of CPUs in our
         * sched_group.
         *
         * This changes load balance semantics a bit on who can move        //这就把load balance搞得有点像:把load迁移到一个given cpu上。
         * load to a given_cpu. In addition to the given_cpu itself            //加上given cpu其本身(或者是一个兄弟cpu[lib_cpu]代表given cpu,因为given cpu在nohz-idle),
         * (or a ilb_cpu acting on its behalf where given_cpu is            //那么我们现在就是将balance_cpu的load迁移到given cpu上。
         * nohz-idle), we now have balance_cpu in a position to move        
         * load to given_cpu. In rare situations, this may cause            //在一些较少发生情况下,这可能导致冲突(balance cpu 和 given cpu/ilb_cpu互相是独立挑选出来的,
         * conflicts (balance_cpu and given_cpu/ilb_cpu deciding            //并且也可能在同一时刻触发迁移load到given cpu),从而导致load一下子迁移到given cpu太多
         * _independently_ and at _same_ time to move some load to
         * given_cpu) causing exceess load to be moved to given_cpu.
         * This however should not happen so much in practice and            //但是,这不应该经常发生,并且后续的load balance操作也会纠正这些迁移过多的load。
         * moreover subsequent load balance cycles should correct the
         * excess load moved.
         */
        if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {    //如果出现过由于cpuset限制了dst cpu,并重新选择了dst cpu,而且不平衡仍然存在:使用new dst cpu作为dst cpu重新检查detach和attach

            /* Prevent to re-select dst_cpu via env's CPUs */
            cpumask_clear_cpu(env.dst_cpu, env.cpus);            //防止重新又选到不满足cpuset条件的cpu作为dst cpu,这里将cpuset限制的dst cpu从env.cpus中去除

            env.dst_rq     = cpu_rq(env.new_dst_cpu);                //将new dst cpu改为dst cpu,相应rq也改为dst rq
            env.dst_cpu     = env.new_dst_cpu;
            env.flags    &= ~LBF_DST_PINNED;                        //去掉flag:LBF_DST_PINNED
            env.loop     = 0;                                    //重置loop
            env.loop_break     = sched_nr_migrate_break;

            /*
             * Go back to "more_balance" rather than "redo" since we        //跳转到more balance,因为src cpu没有发生改变
             * need to continue with same src_cpu.
             */
            goto more_balance;
        }

        /*
         * We failed to reach balance because of affinity.                    //由于cpuset亲和度导致无法达到负载balance
         */
        if (sd_parent) {                            //如果当前sd是MC level
            int *group_imbalance = &sd_parent->groups->sgc->imbalance;

            if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)      //如果有因cpuset而遗留imbalance的
                *group_imbalance = 1;                                        //则设置当前sd的parent sd的 sgc->imbalance,让parent sd做rebalance的概率增高:*group_imbalance = 1
        }

        /* All tasks on this runqueue were pinned by CPU affinity */
        if (unlikely(env.flags & LBF_ALL_PINNED)) {                            //如果dst cpu上所有task都由于cpu亲和度拴住了,无法迁移(flag:LBF_ALL_PINNED)
            cpumask_clear_cpu(cpu_of(busiest), cpus);                          //将busiest rq对应的cpu,从cpus集合中去掉
            /*
             * Attempting to continue load balancing at the current         //仅当还有剩余的active cpu可能作为busiest cpu,且被pull load的busiest cpu不被包含在接收迁移load的dst group中
             * sched_domain level only makes sense if there are                
             * active CPUs remaining as possible busiest CPUs to            //尝试继续在当前调度域level进行load balance
             * pull load from which are not contained within the
             * destination group that is receiving any migrated
             * load.
             */
            if (!cpumask_subset(cpus, env.dst_grpmask)) {        //如果剩余的cpus都不在dst cpu的调度域中
                env.loop = 0;                                    //则重置loop
                env.loop_break = sched_nr_migrate_break;
                goto redo;                                        //重新进行是否需要load balance
            }
            goto out_all_pinned;
        }
    }

no_move:
    if (!ld_moved) {                //经过几轮的努力尝试,最终迁移的进程数ld_moved还是0,说明balance失败
        /*
         * Increment the failure counter only on periodic balance.    //仅在周期性balance时,累计失败次数。
         * We do not want newidle balance, which can be very        //我们不想统计newidle balance,它会非常频繁,破坏失败计数,
         * frequent, pollute the failure counter causing            //从而导致激进的cache-hot迁移和active balance
         * excessive cache_hot migrations and active balances.
         */
        if (idle != CPU_NEWLY_IDLE) {            //如果不是newly idle
            if (env.src_grp_nr_running > 1)        //并且src group中的nr_running > 1
                sd->nr_balance_failed++;        //则统计失败次数
        }

        if (need_active_balance(&env)) {                //(2-6)判断是否需要active balance
            unsigned long flags;

            raw_spin_lock_irqsave(&busiest->lock, flags);

            /*
             * The CPUs are marked as reserved if tasks                //cpu中task从其他cpu上pull/push的,这样的cpu会被标记为reserved
             * are pushed/pulled from other CPUs. In that case,        //在这个情况下,就要退出load balance
             * bail out from the load balancer.
             */
            if (is_reserved(this_cpu) ||                            //当前this cpu是reserved的
                    is_reserved(cpu_of(busiest))) {                 //或者busiest cpu是reserved的
                raw_spin_unlock_irqrestore(&busiest->lock,
                                flags);
                *continue_balancing = 0;                            //那么就要退出load balance,清除flag:*continue_balancing = 0
                goto out;
            }

            /*
             * Don't kick the active_load_balance_cpu_stop,            //如果在busiest cpu上的curr进程不能迁移到当前cpu(this cpu)上,
             * if the curr task on busiest CPU can't be             //不要触发active_load_balance_cpu_stop
             * moved to this_cpu:
             */
            if (!cpumask_test_cpu(this_cpu,                            //this_cpu不在busiest的curr进程的cpuset内
                        &busiest->curr->cpus_allowed)
                || !can_migrate_boosted_task(busiest->curr,            //或者busiest上的curr进程是开了boost strcit max,并且在rtg内,并且dst cpu的orig_capacity < src cpu的orig_capacity
                        cpu_of(busiest), this_cpu)) {                  //从而导致不能迁移busiest上的curr进程
                raw_spin_unlock_irqrestore(&busiest->lock,
                                flags);
                env.flags |= LBF_ALL_PINNED;                        //则设置flag:env.flags |= LBF_ALL_PINNED
                goto out_one_pinned;
            }

            /*
             * ->active_balance synchronizes accesses to        //active_balance标记是与active_balance_work同步的
             * ->active_balance_work.  Once set, it's cleared    //标记只在active load balance完成之后会清除
             * only after active load balance is finished.
             */
            if (!busiest->active_balance &&                //busiest没有处于active balance状态
                !cpu_isolated(cpu_of(busiest))) {    //并且busiest rq的cpu没有isolate
                busiest->active_balance = 1;            //那么就标记busiest rq的状态为active balance
                busiest->push_cpu = this_cpu;            //push cpu为当前cpu(this_cpu)
                active_balance = 1;                        //同时标记active_balance =1,表示active_balance work开始
                mark_reserved(this_cpu);                //标记this cpu为reserved
            }
            raw_spin_unlock_irqrestore(&busiest->lock, flags);

            if (active_balance) {
                stop_one_cpu_nowait(cpu_of(busiest),                        //开启active_load_balance work的工作队列(这个进程调度类是stop class)
                    active_load_balance_cpu_stop, busiest,                    //工作函数为:active_load_balance_cpu_stop
                    &busiest->active_balance_work);
                *continue_balancing = 0;
            }

            /* We've kicked active balancing, force task migration. */    //已经触发active balance,并强制执行task迁移
            sd->nr_balance_failed = sd->cache_nice_tries +                //把balance失败计数改为cache_nice_tries + 10 -1
                    NEED_ACTIVE_BALANCE_THRESHOLD - 1;
        }
    } else
        sd->nr_balance_failed = 0;        //load balance成功发生迁移的话,清空失败计数

    if (likely(!active_balance)) {            //没有触发active balance 或者active balance完成了
        /* We were unbalanced, so reset the balancing interval */
        sd->balance_interval = sd->min_interval;        //重置balance间隔时间为min_interval
    } else {
        /*
         * If we've begun active balancing, start to back off. This        //如果我们正在进行active balance,那么就要将间隔搞大点
         * case may not be covered by the all_pinned logic if there
         * is only 1 task on the busy runqueue (because we don't call
         * detach_tasks).
         */
        if (sd->balance_interval < sd->max_interval)    //balance间隔 < max_interval
            sd->balance_interval *= 2;                    //则将balance间隔放大成2倍
    }

    goto out;

out_balanced:
    /*
     * We reach balance although we may have faced some affinity            //尽管我们遇到了一些cpu亲和度的限制,但我们已经达到balance了
     * constraints. Clear the imbalance flag only if other tasks got        //只有当其他task有机会迁移或者解决不平衡,才可以清空不平衡的flag
     * a chance to move and fix the imbalance.
     */
    if (sd_parent && !(env.flags & LBF_ALL_PINNED)) {                    //当前调度域并非root doamin,并且flag !=LBF_ALL_PINNED(cpuset仅限制了部分,或没有限制)
        int *group_imbalance = &sd_parent->groups->sgc->imbalance;

        if (*group_imbalance)            //如果有group不平衡标记
            *group_imbalance = 0;        //清空group不平衡标记
    }

out_all_pinned:
    /*
     * We reach balance because all tasks are pinned at this level so        //由于所有task都被拴住了,无法进行task迁移,所以这个level无法达到真正平衡
     * we can't migrate them. Let the imbalance flag set so parent level    //让保留不平衡flag,让parent level尝试迁移
     * can try to migrate them.
     */
    schedstat_inc(sd->lb_balanced[idle]);        //统计load balance计数

    sd->nr_balance_failed = 0;        //清空load balance失败计数

out_one_pinned:
    /* tune up the balancing interval */                            //把balance间隔放大一些(个人猜测为了延后一段时间,再重新尝试load balance )
    if (((env.flags & LBF_ALL_PINNED) &&                        //1. 如果flag有LBF_ALL_PINNED 并且balance间隔 < MAX_PINNED_INTERVAL(512)
            sd->balance_interval < MAX_PINNED_INTERVAL) ||        
            (sd->balance_interval < sd->max_interval))            //2. balance间隔 < max_interval
        sd->balance_interval *= 2;                                //将balance间隔放大为2倍(上面条件满足其一)

    ld_moved = 0;            //清空task迁移计数
out:
    trace_sched_load_balance(this_cpu, idle, *continue_balancing,
                 group ? group->cpumask[0] : 0,
                 busiest ? busiest->nr_running : 0,
                 env.imbalance, env.flags, ld_moved,
                 sd->balance_interval, active_balance,
                 sd_overutilized(sd), env.prefer_spread);
    return ld_moved;
}

(2-1)判断dst cpu(其实就是当前cpu)是否可以作为light cpu进行load balance(light cpu说明任务较少,会从任务较重的cpu上pull任务过来)

 

spark 负载均衡开关 负载均衡pool_ci_04

其中,

group_balance_mask(sg)获取的是:sg->sgc->cpumask(sg为发生负载均衡的sd对应的sg)。根据不同level,会代表不同的cpu(获取的是:sg->cpumask)。如果是MC level,则是当前cpu(比如cpu 0,就是0x1);如果是DIE level,则是当前cluster内的所有cpu core(比如cluster-0内的cpu 0 - 4,就是0x0F)。

   group_balance_cpu_not_isolated(sg)其实是sched_group_span(sg)和group_balance_mask(sg)的交集(这2者其实是一致的),并且不处于isolate的第1个cpu。sched_group_span(sg):根据不同level,会代表不同的cpu(获取的是:sg->cpumask)。如果是MC level,则是当前cpu(比如cpu 0,就是0x1);如果是DIE level,则是当前cluster内的所有cpu core(比如cluster-0内的cpu 0 - 4,就是0x0F)。这个与上面的sg->sgc->cpumask一样。

static int should_we_balance(struct lb_env *env)
{
    struct sched_group *sg = env->sd->groups;
    int cpu, balance_cpu = -1;

    /*
     * Ensure the balancing environment is consistent; can happen
     * when the softirq triggers 'during' hotplug.
     */
    if (!cpumask_test_cpu(env->dst_cpu, env->cpus))        //检测目标cpu是否同时满足cpus的条件:1、在env-sd调度域中;2、cpu处于active
        return 0;

    /*
     * In the newly idle case, we will allow all the CPUs
     * to do the newly idle load balance.
     */
    if (env->idle == CPU_NEWLY_IDLE)        //如果是new idle load balance,那么允许所有env->cpus做newly idle load balance
        return 1;

    /* Try to find first idle CPU */
    for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {    //从sg范围和env->cpus的交集中,找到第一个idle的cpu
        if (!idle_cpu(cpu) || cpu_isolated(cpu))
            continue;

        balance_cpu = cpu;        //将第一个idle cpu标记到balance_cpu
        break;
    }

    if (balance_cpu == -1)        //如果没有找到idle cpu,那么就选择sg范围中unisolate的第一个cpu
        balance_cpu = group_balance_cpu_not_isolated(sg);

    /*
     * First idle CPU or the first CPU(busiest) in this sched group
     * is eligible for doing load balancing at this and above domains.
     */
    return balance_cpu == env->dst_cpu;        //第一个idle cpu或者第一个unisolate的cpu是有资格做load balance的
}

 (2-2)如果存在不均衡,那么就找到调度域中,本层级sched_group链表负载最重的(busiest)的sched_group:busiest的group。同时计算出负载均衡所需迁移的load

  1. 初始化sds结构体
  2. 更新用于load balance的调度域很多相关数据---update_sd_lb_stats()
  3. 当开启EAS的情况下,并且pd(性能域)存在且sd不处于overutil时(系统没有进入overutilized状态之前,EAS起作用。如果EAS起作用,那么负载可能就是不均衡的(考虑功耗),这是依赖task placement的结果)因此,这时候要过滤一些不需要进行load banlance的情况:
  • 非newly idle情况,不进行load balance-----------这个在其他平台不存在,个人也觉得这个过滤条件有点奇怪,可能要结合qcom平台的EAS一起看
  • 迁移dst cpu不在env->sd->groups范围内 或者 没有找到最繁忙的group,那么也不进行load balance。
  • 如果busiest group中有超过1个cpu(即现在是DIE level,cluster间迁移),并且local cpu的orig capacity大于busiest cpu的。就不进行load balance---------不允许从小核将task迁移到大核
  • 如果local cpu与busiest cpu有同样的orig capacity,或者是cluster内迁移:
  • 进一步判断busiest cpu上task数量是否只有1个,如果是,则不进行load balance。------这种明显是1个load大的task,往兄弟cpu或者同capacity cpu迁移,不会改善情况,所以避免迁移。
  • 检查ASYM feature,这个功能对应sd->flags为SD_ASYM_PACKING,该flag只会在SMT level带有。所以当前平台不支持,所以这块暂不关注()

判断一些情况,满足则必须进行load balance(force_balanced):

  • busiest group type为group_imbalanced
  • dst cpu处于空闲,且local group有空闲capacity,且busiest 处于overload状态(group_no_capacity == 1)
  • busiest group type为group_misfit_task,即group内有misfit task
  • 再判断一些情况,满足则不需要进行load balance(out_balanced):
  • 目标cpu所在的local group比busiest group更繁忙(通过比较avg_load)。avg_load计算方法:调度域中所有group的total load * 1024 / 调度域中所有group的total capacity
  • 目标cpu所在的local group的avg_load >= 调度域的avg_load,那就不要pull task(因为这个local group肯定不是最空闲的group,兄弟姐妹group中有更空闲的)
  • 如果目标cpu处于idle,且busiest group不处于overload状态,且目标cpu所在的local group中的idle cpu个数 <= busiest group的idle cpu个数+1
  • 如果目标cpu不是idle状态(newly idle、not idle),使用imbalance_pct来判断是否需要进行load balance。判断计算方法:busiest group的avg_load *100 <= env->sd->imbalance_pct * local group的avg_load,imbalance_pct在sd建立时已确定(MC:117、DIE:125)
  • 根据之前的判断进行force_balanced,或者不做load balance直接return NULL
  • force_balanced:
  1. 保存busiest group type到env_src_group_type
  2. 进一步判断一些条件并计算imbalance值(calculate_imbalance)
  3. 如果imbalance不为0,则return busiest group;反之return NULL
  • out_balanced:
  1. 设置env->imbalance = 0,并return NULL
/**
 * find_busiest_group - Returns the busiest group within the sched_domain
 * if there is an imbalance.
 *
 * Also calculates the amount of weighted load which should be moved
 * to restore balance.
 *
 * @env: The load balancing environment.
 *
 * Return:    - The busiest group if imbalance exists.
 */
static struct sched_group *find_busiest_group(struct lb_env *env)
{
    struct sg_lb_stats *local, *busiest;
    struct sd_lb_stats sds;

    init_sd_lb_stats(&sds);            //初始化sds结构体

    /*
     * Compute the various statistics relavent for load balancing at
     * this level.
     */
    update_sd_lb_stats(env, &sds);                //(2-2-1)更新用于load balance的调度域很多相关数据

    if (static_branch_unlikely(&sched_energy_present)) {            //sched_energy_present值作为是否开启EAS选核的开关,当前平台是打开的
        struct root_domain *rd = env->dst_rq->rd;

        if (rcu_dereference(rd->pd) && !sd_overutilized(env->sd)) {                //root doamin的perf domain结构体存在?(暂时没理解),并且env->sd不处于overutil
            int cpu_local, cpu_busiest;
            unsigned long capacity_local, capacity_busiest;

            if (env->idle != CPU_NEWLY_IDLE)                //非newly idle情况,那么就说明没有imbalance;只考虑newly idle情况做load balance???这里看了下AOSP kernel-5.10没有这个奇怪条件过滤
                goto out_balanced;

            if (!sds.local || !sds.busiest)                    //迁移目标cpu不在当前env->sd->groups内,或者没有找到busiest group。那么就说明没有imbalance
                goto out_balanced;

            cpu_local = group_first_cpu(sds.local);            //取出sds.local group中的第一个cpu
            cpu_busiest = group_first_cpu(sds.busiest);    //取出找到的busiest group中的第一个cpu

            /* TODO:don't assume same cap cpus are in same domain */
            capacity_local = capacity_orig_of(cpu_local);            //获取上面的2个cpu的capacity
            capacity_busiest = capacity_orig_of(cpu_busiest);
            if ((sds.busiest->group_weight > 1) &&            //busiest的group_weight > 1(group_weight表示sched group中有多少个CPU,MC level:1、DIE level:4)
                capacity_local > capacity_busiest) {        //并且,local capacity > busiest capacity。那么就说明没有imbalance
                goto out_balanced;                            
            } else if (capacity_local == capacity_busiest ||            //两者capacity相等
                   asym_cap_siblings(cpu_local, cpu_busiest)) {            //或者两个cpu是兄弟cpu
                if (cpu_rq(cpu_busiest)->nr_running < 2)            //如果busiest group中的第一个cpu上只有0个或者1个task
                    goto out_balanced;                                //那么就无法进行迁移
            }
        }
    }

    local = &sds.local_stat;            //取出busiest group和local group的数据
    busiest = &sds.busiest_stat;

    /* ASYM feature bypasses nice load balance check */
    if (check_asym_packing(env, &sds))            //ASYM feature绕过一些检查,能直接return给出结果
        return sds.busiest;

    /* There is no busy sibling group to pull tasks from */    //没有在group中,找到满足条件的task来pull
    if (!sds.busiest || busiest->sum_nr_running == 0)        //即,没有找到busiest group,或者busiest group中没有运行的task
        goto out_balanced;

    /* XXX broken for overlapping NUMA groups */
    sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)    //计算sds的avg_load = 调度域中所有group的total load * 1024 / 调度域中所有group的total capacity
                        / sds.total_capacity;

    /*
     * If the busiest group is imbalanced the below checks don't
     * work because they assume all things are equal, which typically
     * isn't true due to cpus_allowed constraints and the like.
     */
    if (busiest->group_type == group_imbalanced)    //如果busiest sg低level的group,因为cpu affinity没有balance成功,设置了group_imbalanced标志,强制在当前级别上进行balance
        goto force_balance;

    /*
     * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
     * capacities from resulting in underutilization due to avg_load.
     */
    if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&        //目标cpu处于空闲状态,local group有较多空余capacity,busiest group很繁忙(没有capacity)
        busiest->group_no_capacity)                                             //这种情况下,强制进行balance
        goto force_balance;

    /* Misfit tasks should be dealt with regardless of the avg load */
    if (busiest->group_type == group_misfit_task)                            //如果有misfit task,那么强制进行balance
        goto force_balance;

    /*
     * If the local group is busier than the selected busiest group
     * don't try and pull any tasks.
     */
    if (local->avg_load >= busiest->avg_load)            //如果目标cpu的local group比busiest group更繁忙(通过比较avg_load),那就不要尝试pull task
        goto out_balanced;

    /*
     * Don't pull any tasks if this group is already above the domain
     * average load.
     */
    if (local->avg_load >= sds.avg_load)                //如果目标cpu的local group的avg_load已经高于调度域的avg_load,那就不要pull task(因为这个local group肯定不是最空闲的group)
        goto out_balanced;

    if (env->idle == CPU_IDLE) {
        /*
         * This CPU is idle. If the busiest group is not overloaded  //如果目标cpu处于idle,且busiest group不处于overload状态,且目标cpu所在的local group中的
         * and there is no imbalance between this and busiest group  //idle cpu个数 <= busiest group的idle cpu个数+1
         * wrt idle CPUs, it is balanced. The imbalance becomes     //就认为没有imbalance
         * significant if the diff is greater than 1 otherwise we
         * might end up to just move the imbalance on another group
         */
        if ((busiest->group_type != group_overloaded) &&
                (local->idle_cpus <= (busiest->idle_cpus + 1)))
            goto out_balanced;
    } else {
        /*
         * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use        //在newly idle和not idle的case,使用imbalance_pct来计算,并判断是否需要进行load balance
         * imbalance_pct to be conservative.
         */
        if (100 * busiest->avg_load <=                            //busiest group的avg_load *100 <= env->sd->imbalance_pct * local group的avg_load
                env->sd->imbalance_pct * local->avg_load)        //满足则不要做balance,反之,要做load balance
            goto out_balanced;
    }

force_balance:
    /* Looks like there is an imbalance. Compute it */        //走到force_balance分支,说明应该是有imbalance,所以需要进行计算
    env->src_grp_type = busiest->group_type;            //保存busiest group type
    calculate_imbalance(env, &sds);                            //(2-2-2)计算imbalance的程度
    trace_sched_load_balance_stats(sds.busiest->cpumask[0],
                busiest->group_type, busiest->avg_load,
                busiest->load_per_task,    sds.local->cpumask[0],
                local->group_type, local->avg_load,
                local->load_per_task,
                sds.avg_load, env->imbalance);
    return env->imbalance ? sds.busiest : NULL;

out_balanced:
    env->imbalance = 0;
    return NULL;
}

 (2-2-1)更新用于load balance的调度域(sds)很多相关数据

  1. 通过do-while循环,遍历所有env->sd->groups一轮(MC level的sd:遍历当前cluster内的每个cpu core;DIE level的sd:遍历2个cluster)
  • 先判断dst cpu是否在当前sg范围内(函数 sched_group_span(sg)上面已经解释过,根据不同level有不同的范围,详情见(2-1)),
  •   如果在,则置位local_group,并保存local相关信息,local group不会作为sd下busiest group。(local group迁移不会在同一group:所以,发生迁移,要么是MC level下同cluster内的不同cpu;要么DIE level下不同cluster)
  •   同时,如果不是newly idle、或者更新间隔满足的情况下会更新sd的sgc(MC level更新cpu capacity;DIE level下更新cluster对应的sched group capacity)。也就是只更新与load balance相关的sgc
  • 更新当前sg中用于load balance的相关数据,记录到sgs结构体中。
  • 如果是local group,即需要将task放到local group,那么就跳过对其判断是否是busiest group的过程。local group是task的接收者,那么就不需要对齐判断是否是busiest group,因为busiest group是需要将task送出,因为它太繁忙了。
  • 在进行busiest group判断之前,还会进行2个补充更新逻辑:
  • 当前平台永不满足)系统会根据child sd是否支持perfer_sibling而重新更新group_no_capacity和group_type。但是当前平台永远不会成立。
  • 除非大核没有capacity承担当前load,否则不允许将大核迁移到小核,即满足如下所有条件时,更新group_no_capacity为0、并更新group_type:当前sd时DIE level;当前sds处于group_no_capacity;asym_cap_sibling_group_has_capacity函数条件满足并返回true(实际此函数在当前平台永远返回false)。
  • 判断当前sg是否是busiest group,通过循环找到busiest sg和sgs。
  • 更新load balance调度域sd_lb_stats相关数据: 
  • total_running
  • total_load
  • total_capacity
  • total_util

更新状态(主要是传递overutil到parent层sd level )

  • sds中busiest group的task数到env->src_grp_nr_running
  • 判断是否更新overload flag标记到root domain
  • 更新overutil flag标记
  • 如果因MC level有misfit task,那么同时设置DIE level为overutil(将misfit的状态,以overutil flag的方式逐层向parent传递给有不同capcity核的parent level以及 grand parent level等)
  • 如果sd total util *105%  > sd total capacity,那么同时要设置parent level overutil标记
/**
 * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
 * @env: The load balancing environment.
 * @sds: variable to hold the statistics for this sched_domain.
 */
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
    struct sched_domain *child = env->sd->child;
    struct sched_group *sg = env->sd->groups;
    struct sg_lb_stats *local = &sds->local_stat;
    struct sg_lb_stats tmp_sgs;
    bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;  //这里当前平台不会满足,因为只有在DIE level才有child sd,但是flag:SD_PREFER_SIBLING在MC level上并没有包含。所以这里永远为false
    int sg_status = 0;

#ifdef CONFIG_NO_HZ_COMMON
    if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))     //当此次是newly idle的情况,并且nohz.has_blocked为true(has_blocked表示上次pelt更新时,还有blocked load没有更新完整)
        env->flags |= LBF_NOHZ_STATS;                       //置位flag:LBF_NOHZ_STATS
#endif

    do {
        struct sg_lb_stats *sgs = &tmp_sgs;
        int local_group;

        local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(sg));                //判断目标cpu是否在当前调度组中
        if (local_group) {
            sds->local = sg;
            sgs = local;

            if (env->idle != CPU_NEWLY_IDLE ||
                time_after_eq(jiffies, sg->sgc->next_update))            //如果不是newly idle case,或者更新sgc(Sched Group Capacity调度组的cpu capacity)的间隔满足了。那么就需要更新group capacity
                update_group_capacity(env->sd, env->dst_cpu);       //(2-2-1-1)更新sched group capacity(sgc)相关数据   
        }

        update_sg_lb_stats(env, sg, sgs, &sg_status);                    //(2-2-1-2)更新用于load balance的调度组相关数据

        if (local_group)            //如果dst_cpu在当前调度组中(local group),那么就不需要判断这个sg是否是busiest的sg
            goto next_group;

        /*
         * In case the child domain prefers tasks go to siblings        //假如子调度域中的task满足条件下,优先迁移到兄弟cpu
         * first, lower the sg capacity so that we'll try                //那么,就会尝试将所有超额的task迁移走,来降低调度组的capacity
         * and move all the excess tasks away. We lower the capacity    //只要local group有capacity能容纳这些超额的task,那么就会执行迁移操作
         * of a group only if the local group has the capacity to fit
         * these excess tasks. The extra check prevents the case where    //此外,还要防止一个case:当load最重的group已经利用不足(可能有一个超过所有其他任务之和的超大任务),然后不断地从负载最重的group中pull task
         * you always pull from the heaviest group when it is already
         * under-utilized (possible with a large weight task outweighs
         * the tasks on the system).
         */
        if (prefer_sibling && sds->local &&                        //因为prefer_sibling永远为false,所以这里条件永远不会满足
            group_has_capacity(env, local) &&                     //has capacity的判断与group_no_capacity的逻辑相反
            (sgs->sum_nr_running > local->sum_nr_running + 1)) {
            sgs->group_no_capacity = 1;
            sgs->group_type = group_classify(sg, sgs);        //更新group的状态:overload、imbalance、misfit_task
        }

        /*
         * Disallow moving tasks from asym cap sibling CPUs to other        //不允许将task从大核迁移到小核,除非大核没有capacity承担当前的load
         * CPUs (lower capacity) unless the asym cap sibling group has
         * no capacity to manage the current load.
         */
        if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
            sgs->group_no_capacity &&
            asym_cap_sibling_group_has_capacity(env->dst_cpu,     //(2-2-1-3)这行的判断条件在当前平台永远返回false,永不满足
                        env->sd->imbalance_pct)) {
            sgs->group_no_capacity = 0;                  //更新group_no_capacity
            sgs->group_type = group_classify(sg, sgs);         //更新group_type
        }

        if (update_sd_pick_busiest(env, sds, sg, sgs)) {            //(2-2-1-4)判断当前sg是否是busiest group;通过不断循环判断来找到busiest group
            sds->busiest = sg;
            sds->busiest_stat = *sgs;
        }

next_group:
        /* Now, start updating sd_lb_stats */            //统计调度域中相关数据用于load balance
        sds->total_running += sgs->sum_nr_running;
        sds->total_load += sgs->group_load;
        sds->total_capacity += sgs->group_capacity;
        sds->total_util += sgs->group_util;

        trace_sched_load_balance_sg_stats(sg->cpumask[0],
                sgs->group_type, sgs->idle_cpus,
                sgs->sum_nr_running, sgs->group_load,
                sgs->group_capacity, sgs->group_util,
                sgs->group_no_capacity,    sgs->load_per_task,
                sgs->group_misfit_task_load,
                sds->busiest ? sds->busiest->cpumask[0] : 0);

        sg = sg->next;                            //遍历下一份group
    } while (sg != env->sd->groups);  //直到调度域内的所有group都遍一轮

#ifdef CONFIG_NO_HZ_COMMON
    if ((env->flags & LBF_NOHZ_AGAIN) &&                                                //LBF_NOHZ_AGAIN是否置位,其实就是看本函数上面置位的flag:LBF_NOHZ_STATS,后面在update_sg_lb_stats函数中是否都更新掉所有blocked load。如果还有遗留,则会有LBF_NOHZ_AGAIN的flag
        cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {                //nohz.idle是否在env->sd的调度域中?是的话,返回true

        WRITE_ONCE(nohz.next_blocked,
               jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));            //更新next_blocked的时间戳,应该是为了确定把遗留的blocked load更新掉
    }
#endif

    if (env->sd->flags & SD_NUMA)                                        //numa balance的情况,我们平台不属于numa架构,暂不考虑
        env->fbq_type = fbq_classify_group(&sds->busiest_stat);

    env->src_grp_nr_running = sds->busiest_stat.sum_nr_running;        //保存busiest group中的 sum_nr_running task数量

    if (!env->sd->parent) {                                        //如果env->sd位于root domain(即env->sd是DIE level)
        struct root_domain *rd = env->dst_rq->rd;

        /* update overload indicator if we are at root domain */
        WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);        //更新是否overload的flag到root domain
    }

    if (sg_status & SG_OVERUTILIZED)            //根据当前状态,更新调度组的overutil flag
        set_sd_overutilized(env->sd);        //sd->shared->overutilized = true
    else
        clear_sd_overutilized(env->sd);       //sd->shared->overutilized = false

    /*
     * If there is a misfit task in one cpu in this sched_domain    //如果调度域中有一个misfit task,
     * it is likely that the imbalance cannot be sorted out among    //而且在调度域的多个cpu中迁移,不能解决imbalance问题,
     * the cpu's in this sched_domain. In this case set the            //那么就需要设置overutil的flag到parent调度域中
     * overutilized flag at the parent sched_domain.
     */
    if (sg_status & SG_HAS_MISFIT_TASK) {                        
        struct sched_domain *sd = env->sd->parent;

        /*
         * In case of a misfit task, load balance at the parent            //在上面所述情况下,要在parent调度域中进行load balance,
         * sched domain level will make sense only if the the cpus        //那么parent调度中必须要有不同的capacity的cpu,这样load balance才有意义。
         * have a different capacity. If cpus at a domain level have    //如果parent调度域中只有相同capacity的cpu,那么misfit task不会很好的适应这样的场景,
         * the same capacity, the misfit task cannot be well            //所以,在这样的domain做load balance是没有一亿元的
         * accomodated    in any of the cpus and there in no point in
         * trying a load balance at this level
         */
        while (sd) {                                //遍历层层调度域
            if (sd->flags & SD_ASYM_CPUCAPACITY) {    //找到有不同cpu capacity的调度域
                set_sd_overutilized(sd);            //然后设置overutil flag:sd->shared->overutilized = true
                break;
            }
            sd = sd->parent;
        }
    }

    /*
     * If the domain util is greater that domain capacity, load balancing
     * needs to be done at the next sched domain level as well.
     */
    if (env->sd->parent &&
        sds->total_capacity * 1024 < sds->total_util *                        //如果当前sds的total_capacity * 1024       < sds的total_util * 1078(大约105%) 
             sched_capacity_margin_up[group_first_cpu(sds->local)])
        set_sd_overutilized(env->sd->parent);                                //那么在上一层level也要做load balance,即在parent调度域设置overutil flag

}

 (2-2-1-1)更新sched group capacity(sgc)相关数据

  1.  如果当前sd没有child,即这个sd为最底层的sd(MC level),那么只要更新cpu capacity。因为这个group只有一个cpu,所以capacity就是cpu capacity本身。cpu capacity由其max freq决定,同时也会受thermal限制影响。
  2. 当前平台big.LITTLE 8核(4+4),所以group capacity大致应该是这样的:(关于SD_OVERLAP:调度域是否有重叠,当前平台是没有重叠的)

    MC level:由于是最底层sd,所以sgc->capacity就是对应每个cpu capacity。

这部分的代码中MC level的group capacity详细计算,参考CPU调度域/组建立中的流程,其中同样调用了该函数,因此不再赘述:CPU拓扑结构和调度域/组--更新对应group的capacity

    DIE level:sgc->capacity为child->group的所有sgc->capacity之和。可以认为是cluster内所有cpu的capacity之和。

而最后,统计出每层sgc的group capacity,并记录最小和最大的capacity。

void update_group_capacity(struct sched_domain *sd, int cpu)
{
    struct sched_domain *child = sd->child;
    struct sched_group *group, *sdg = sd->groups;
    unsigned long capacity, min_capacity, max_capacity;
    unsigned long interval;

    interval = msecs_to_jiffies(sd->balance_interval);                //sgc的更新有间隔限制:1 ~ HZ/10
    interval = clamp(interval, 1UL, max_load_balance_interval);
    sdg->sgc->next_update = jiffies + interval;

    if (!child) {                        //如果是MC level的sd更新sgc,那么就只要更新cpu capacity,因为MC level的sg只有单个cpu在内
        update_cpu_capacity(sd, cpu);    //(2-6-1-1)更新cpu capacity
        return;
    }

    capacity = 0;
    min_capacity = ULONG_MAX;
    max_capacity = 0;

    if (child->flags & SD_OVERLAP) {            //这个是sd有重叠的情况,当前平台没有sd重叠
        /*
         * SD_OVERLAP domains cannot assume that child groups
         * span the current group.
         */

        for_each_cpu(cpu, sched_group_span(sdg)) {
            struct sched_group_capacity *sgc;
            struct rq *rq = cpu_rq(cpu);

            if (cpu_isolated(cpu))
                continue;

            /*
             * build_sched_domains() -> init_sched_groups_capacity()
             * gets here before we've attached the domains to the
             * runqueues.
             *
             * Use capacity_of(), which is set irrespective of domains
             * in update_cpu_capacity().
             *
             * This avoids capacity from being 0 and
             * causing divide-by-zero issues on boot.
             */
            if (unlikely(!rq->sd)) {
                capacity += capacity_of(cpu);
            } else {
                sgc = rq->sd->groups->sgc;
                capacity += sgc->capacity;
            }

            min_capacity = min(capacity, min_capacity);
            max_capacity = max(capacity, max_capacity);
        }
    } else  {
        /*
         * !SD_OVERLAP domains can assume that child groups        因为没有sd重叠,那么所有child sd的groups合在一起,就是当前的group
         * span the current group.
         */

        group = child->groups;
        do {                                                    //do-while遍历child sd的sg环形链表;当前平台为例,走到这里是DIE level,那么child sd就是MC level的groups
            struct sched_group_capacity *sgc = group->sgc;        //获取对应sgc
            __maybe_unused cpumask_t *cpus =
                    sched_group_span(group);                    //因为group是处于MC level,所以范围就是sg对应的cpu

            if (!cpu_isolated(cpumask_first(cpus))) {        //排除isolate状态的cpu
                capacity += sgc->capacity;                    //将每个sgc(cpu)的capacity累加起来
                min_capacity = min(sgc->min_capacity,        //保存最小的sgc->capacity
                            min_capacity);
                max_capacity = max(sgc->max_capacity,        //保存最大的sgc->capacity
                            max_capacity);
            }
            group = group->next;
        } while (group != child->groups);
    }

    sdg->sgc->capacity = capacity;                //将MC level中每个sgc->capacity累加起来,其总和作为DIE level中groups
    sdg->sgc->min_capacity = min_capacity;        //并保存最大、最小capacity
    sdg->sgc->max_capacity = max_capacity;
}

 在MC level时,只需更新cpu capacity即可:

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
    unsigned long capacity = arch_scale_cpu_capacity(cpu);    //获取per_cpu变量cpu_scale
    struct sched_group *sdg = sd->groups;

    capacity *= arch_scale_max_freq_capacity(sd, cpu);        //获取per_cpu变量max_freq_scale,参与计算
    capacity >>= SCHED_CAPACITY_SHIFT;                        //这2步计算为:cpu_scale * max_freq_scale / 1024

    capacity = min(capacity, thermal_cap(cpu));                //计算得出的capacity不能超过thermal限制中的cpu的capacity
    cpu_rq(cpu)->cpu_capacity_orig = capacity;                //将计算得出的capacity作为当前cpu rq的cpu_capacity_orig

    capacity = scale_rt_capacity(cpu, capacity);        //(2-6-1-1-1)计算cfs rq剩余的cpu capacity

    if (!capacity)            //如果没有剩余cpu capacity给cfs了,那么就强制写为1
        capacity = 1;

    cpu_rq(cpu)->cpu_capacity = capacity;        //更新相关sgc capacity:cpu rq的cpu_capacity、sgc的最大/最小的capacity
    sdg->sgc->capacity = capacity;
    sdg->sgc->min_capacity = capacity;
    sdg->sgc->max_capacity = capacity;
}

(2-2-1-2)更新用于load balance的调度组(sg)相关数据

  1. 遍历在env->cpus中,且在当前sg范围的每个cpu,但忽略isolate的cpu。获取cpu对应的负载均线(cpu负载相关分析,见:)。不同 level下,对应获取cpu负载均线会有差别,具体信息(包括flags、imbalance_pct等)如下: 

   

spark 负载均衡开关 负载均衡pool_ci_05

   

spark 负载均衡开关 负载均衡pool_ci_06

  1. 首先,根据balance的类别:idle、newly idle、busy,来使用对应特定的cpu负载均线。再次,根据是否为local group,还会有不同的bias:local group【负载均线值和实时值,取其较】,非local group【负载均线值和实时值,取其较】。将最后的结果作为该cpu的balance load。 需要注意的是对应的cpu_load需要将idx减1,例如busy balance情况下,busy_idx = 2,而均线对应的是cpu_load[1],即(old_load + new_load)/ 2 的那条均线。(注意:target_load() 和 source_load() 两个函数的实现)
  • cpu load这部分,我看了kernel-5.4以后就不再使用这种bias了,而是直接使用cpu_load[0]的数据。
  1. 统计用于load balance的sched group的相关数据【sgs结构中的数据】,包括:
  1. group_load:group内上面获取的balance load之和
  2. group_util:group内总cpu_util,即WALT或者PELT统计出来的cpu utilization。
  3. sum_nr_running:group内所有cfs rq上的总runnable task数
  4. sg_status:group的当前是否处于特殊状态:overload、overutil、misfit
  5. sum_weighted_load:group内所有cpu实时负载之和
  6. idle_cpus:group内idle cpu的个数
  7. group_misfit_task_load:如果当前是在DIE level,判断是否有出现overload的情况,如有,则将所有rq中最大的misfit_task_load记录下来
  8. group_capacity:对应了group->sgc->capacity。MC level下就是每个cpu的capacity;DIE level下就是每个cluster内cpu capacity之和。
  9. avg_load:计算得出,计算公式为:group_load *1024 / group_capacity
  10. group_weight:就是group->group_weight;MC level下 = 1;DIE level下= 4
  11. group_no_capacity:表示group是否还有剩余capacity,其实是记录group是否overload,如果已经overload,那么自然就没有剩余capacity了。overload的条件为:group_capacity * 100 < group_util * imbalance_pct (当前sd的不平衡百分比门限,MC:117、DIE:125)
  12. group_type:也是根据group的相关参数,记录group处于什么状态的。主要类型由:
  • group_other(0,正常状态),
  • group_misfit_task(1,有misfit task),
  • group_imbalanced(2,不平衡,group_sgc_imbalance有值),
  • group_overloaded(3,过载,也就是group_no_capacity为true)。 

        有点类似sd_status,但是group type中多了"group_imbalance",而没有overutil。

  • load_per_task:计算得出,计算公式为:sum_weighted_load / sum_nr_running

红色部分

本函数根据(2-2-1)的循环,就是更新2轮:第1轮是cluster内的每个cpu sg;第2轮,是更新2个cluster sg。

/**
 * update_sg_lb_stats - Update sched_group's statistics for load balancing.
 * @env: The load balancing environment.
 * @group: sched_group whose statistics are to be updated.
 * @sgs: variable to hold the statistics for this group.
 * @sg_status: Holds flag indicating the status of the sched_group
 */
static inline void update_sg_lb_stats(struct lb_env *env,
                      struct sched_group *group,
                      struct sg_lb_stats *sgs,
                      int *sg_status)
{
    int local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));        //flag:local_group,表示迁移的目标cpu是否在当前调度组中
    int load_idx = get_sd_load_idx(env->sd, env->idle);        //(2-2-1-2-1)获取使用哪组cpu负载数据来用于load balance。分别由idle、newly idle、busy三组。
    unsigned long load;
    int i, nr_running;

    memset(sgs, 0, sizeof(*sgs));

    for_each_cpu_and(i, sched_group_span(group), env->cpus) {                //遍历每个满足:在group范围内,且又在env->cpus(load_balance_mask)范围内 的cpu
        struct rq *rq = cpu_rq(i);

        if (cpu_isolated(i))        //过滤isolate cpu
            continue;

        if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))  //因为更新pelt时,有遗留的blocked load未更新掉;所以,在这里尝试通过update_nohz_stats函数来更新掉
            env->flags |= LBF_NOHZ_AGAIN;        //更新flag,因为此次update_nohz_stats还是没有完全更新掉blocked load,置位flag后,在流程后面启动周期来完成还遗留的blocked load

        /* Bias balancing toward CPUs of our domain: */
        if (local_group)
            load = target_load(i, load_idx);        //获取cpu[i]的负载,loacal group情况:idx非0情况下,会在[cpu负载均线值]和[cpu当前实时负载]两者中,选更【大】的
        else
            load = source_load(i, load_idx);        //获取cpu[i]的负载,非local group情况:idx非0情况下,会[cpu负载均线值]和[cpu当前实时负载]在两者中,选更【小】的

        sgs->group_load += load;            //统计调度组中所有cpu的总负载(会有bias)
        sgs->group_util += cpu_util(i);        //统计调度组总util
        sgs->sum_nr_running += rq->cfs.h_nr_running;    //统计所有group内cpu rq总的runnable task数量

        nr_running = rq->nr_running;        //当前层se内的runnable task数量(不包含子se的runnable task);区别于h_nr_running,h_nr_running会包含所有子se的runnable task 数量
        if (nr_running > 1)
            *sg_status |= SG_OVERLOAD;        //标记这个cpu上有 >1 个runnable task

        if (cpu_overutilized(i)) {
            *sg_status |= SG_OVERUTILIZED;    //标记这个cpu已经overutil

            if (rq->misfit_task_load)
                *sg_status |= SG_HAS_MISFIT_TASK;    //标记rq中是否有misfit taskt
        }

#ifdef CONFIG_NUMA_BALANCING
        sgs->nr_numa_running += rq->nr_numa_running;
        sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
        sgs->sum_weighted_load += weighted_cpuload(rq);    //统计sg调度组中所有cpu实时的总负载:累加每个cpu rq的cfs_rq->avg.runnable_load_avg
        /*
         * No need to call idle_cpu() if nr_running is not 0
         */
        if (!nr_running && idle_cpu(i))
            sgs->idle_cpus++;                    //统计idle cpu个数

        if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
            sgs->group_misfit_task_load < rq->misfit_task_load) {
            sgs->group_misfit_task_load = rq->misfit_task_load;
            *sg_status |= SG_OVERLOAD;            //判断是否有overload(DIE level),并且会把rq更大的misfit task load作为group_misift_task_load
        }
    }

    /* Isolated CPU has no weight */
    if (!group->group_weight) {
        sgs->group_capacity = 0;        //isolate cpu的参数初始化
        sgs->avg_load = 0;
        sgs->group_no_capacity = 1;
        sgs->group_type = group_other;
        sgs->group_weight = group->group_weight;
    } else {
        /* Adjust by relative CPU capacity of the group */
        sgs->group_capacity = group->sgc->capacity;                    //active cpu的相关参数计算
        sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) /
                            sgs->group_capacity;

        sgs->group_weight = group->group_weight;

        sgs->group_no_capacity = group_is_overloaded(env, sgs);    //判断group是否overload
        sgs->group_type = group_classify(group, sgs);                //获取group type
    }

    if (sgs->sum_nr_running)
        sgs->load_per_task = sgs->sum_weighted_load /
                        sgs->sum_nr_running;
}

(2-2-1-2-1)获取使用哪组cpu负载数据来用于load balance。根据当前cpu的状态,分别对应idle balance(idle_idx)、newly idle balance(newidle_idx)、busy balance(busy_idx)三组负载idx

通过上面的表格看到:

busy_idx:2

newidle_idx:0

idle_idx:区分MC/DIE level。分别是MC:0,DIE:1

/**
 * get_sd_load_idx - Obtain the load index for a given sched domain.
 * @sd: The sched_domain whose load_idx is to be obtained.
 * @idle: The idle status of the CPU for whose sd load_idx is obtained.
 *
 * Return: The load index.
 */
static inline int get_sd_load_idx(struct sched_domain *sd,
                    enum cpu_idle_type idle)
{
    int load_idx;

    switch (idle) {
    case CPU_NOT_IDLE:
        load_idx = sd->busy_idx;
        break;

    case CPU_NEWLY_IDLE:
        load_idx = sd->newidle_idx;
        break;
    default:
        load_idx = sd->idle_idx;
        break;
    }

    return load_idx;
}

 target_load() 和 source_load() 两个函数的实现

/*
 * Return a low guess at the load of a migration-source CPU weighted
 * according to the scheduling class and "nice" value.
 *
 * We want to under-estimate the load of migration sources, to
 * balance conservatively.
 */
static unsigned long source_load(int cpu, int type)
{
    struct rq *rq = cpu_rq(cpu);
    unsigned long total = weighted_cpuload(rq);  //total:cfs_rq->avg.runnable_load_avg

    if (type == 0 || !sched_feat(LB_BIAS))
        return total;

    return min(rq->cpu_load[type-1], total);
}

/*
 * Return a high guess at the load of a migration-target CPU weighted
 * according to the scheduling class and "nice" value.
 */
static unsigned long target_load(int cpu, int type)
{
    struct rq *rq = cpu_rq(cpu);
    unsigned long total = weighted_cpuload(rq);  //total:cfs_rq->avg.runnable_load_avg

    if (type == 0 || !sched_feat(LB_BIAS))
        return total;

    return max(rq->cpu_load[type-1], total);
}

 cpu_util:注意WALT、PELT的最终使用的数据有不同。

static inline unsigned long cpu_util(int cpu)
{
    struct cfs_rq *cfs_rq;
    unsigned int util;

#ifdef CONFIG_SCHED_WALT
    u64 walt_cpu_util =
        cpu_rq(cpu)->wrq.walt_stats.cumulative_runnable_avg_scaled;

    return min_t(unsigned long, walt_cpu_util, capacity_orig_of(cpu));
#endif

    cfs_rq = &cpu_rq(cpu)->cfs;
    util = READ_ONCE(cfs_rq->avg.util_avg);

    if (sched_feat(UTIL_EST))
        util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

    return min_t(unsigned long, util, capacity_orig_of(cpu));
}

 group_no_capacity的判断:

  • 如果task数量 < group中cpu数量,则必定不是overload
  • 上面条件不满足的话,如果使用WALT负载计算,本次load balance是busy balance或者newly idle balance的情况下,打开了walt rotation。则属于overload
  • 上面条件不满足的话,判断:group_capacity *100 < group_util * 当前sd的imbalance_pct。如果满足,则属于overload
  • 以上条件都不满足,则不是overload
/*
 *  group_is_overloaded returns true if the group has more tasks than it can
 *  handle.
 *  group_is_overloaded is not equals to !group_has_capacity because a group
 *  with the exact right number of tasks, has no more spare capacity but is not
 *  overloaded so both group_has_capacity and group_is_overloaded return
 *  false.
 */
static inline bool
group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
{
    if (sgs->sum_nr_running <= sgs->group_weight)
        return false;

#ifdef CONFIG_SCHED_WALT
    if (env->idle != CPU_NOT_IDLE && walt_rotation_enabled)
        return true;
#endif

    if ((sgs->group_capacity * 100) <
            (sgs->group_util * env->sd->imbalance_pct))
        return true;

    return false;
}

 group_type的判断:

static inline enum
group_type group_classify(struct sched_group *group,
              struct sg_lb_stats *sgs)
{
    if (sgs->group_no_capacity)        //如果group_no_capacity == 1,则说明group overload
        return group_overloaded;

    if (sg_imbalanced(group))        //判断是否有group->sgc->imbalance,如有,则说明group处于不平衡状态(暂时还没有找到sgc->imbalance赋值的地方???)
        return group_imbalanced;

    if (sgs->group_misfit_task_load)    //如果有misfit task load,则说明group处于有misfit task的状态
        return group_misfit_task;

    return group_other;        //以上条件都不满足,则处于正常状态
}

(2-2-1-3)因为asym_cap_sibling_cpus只会在WALT逻辑中进行赋值,并且以当前平台的架构:4+4、6+2,甚至4+3+1的架构,都不会被赋有效值,一直是0。所以此函数永远返回false

static inline bool asym_cap_sibling_group_has_capacity(int dst_cpu, int margin)
{
    int sib1, sib2;
    int nr_running;
    unsigned long total_util, total_capacity;

    if (cpumask_empty(&asym_cap_sibling_cpus) ||
            cpumask_test_cpu(dst_cpu, &asym_cap_sibling_cpus))
        return false;

    sib1 = cpumask_first(&asym_cap_sibling_cpus);
    sib2 = cpumask_last(&asym_cap_sibling_cpus);

    if (!cpu_active(sib1) || cpu_isolated(sib1) ||
        !cpu_active(sib2) || cpu_isolated(sib2))
        return false;

    nr_running = cpu_rq(sib1)->cfs.h_nr_running +
            cpu_rq(sib2)->cfs.h_nr_running;

    if (nr_running <= 2)
        return true;

    total_capacity = capacity_of(sib1) + capacity_of(sib2);
    total_util = cpu_util(sib1) + cpu_util(sib2);

    return ((total_capacity * 100) > (total_util * margin));
}

(2-2-1-4)判断当前sg是否是busiest group;通过不断循环判断来找到busiest group

spark 负载均衡开关 负载均衡pool_#ifdef_07

/**
 * update_sd_pick_busiest - return 1 on busiest group
 * @env: The load balancing environment.
 * @sds: sched_domain statistics
 * @sg: sched_group candidate to be checked for being the busiest
 * @sgs: sched_group statistics
 *
 * Determine if @sg is a busier group than the previously selected
 * busiest group.
 *
 * Return: %true if @sg is a busier group than the previously selected
 * busiest group. %false otherwise.
 */
static bool update_sd_pick_busiest(struct lb_env *env,
                   struct sd_lb_stats *sds,
                   struct sched_group *sg,
                   struct sg_lb_stats *sgs)
{
    struct sg_lb_stats *busiest = &sds->busiest_stat;

    /*
     * Don't try to pull misfit tasks we can't help.                       1、不尝试pull misfit tasks
     * We can use max_capacity here as reduction in capacity on some       2、使用max_capacity来确认group中是否有cpu能最终解决avg_load不平衡
     * CPUs in the group should either be possible to resolve
     * internally or be covered by avg_load imbalance (eventually).
     */
    if (sgs->group_type == group_misfit_task &&                    //如果是有misfit task的group
        (!group_smaller_max_cpu_capacity(sg, sds->local) ||        //当前sg的max_capacity * margin_up >= local_sg的max_capacity *1024;其中margin_up为1078,大约5%的up
             !group_has_capacity(env, &sds->local_stat)))            //或者group是否不满足has capacity
        return false;
                                                        //按照group_type判断
    if (sgs->group_type > busiest->group_type)            //group_type更大的才是busiest group
        return true;

    if (sgs->group_type < busiest->group_type)            //如果group_type小于busiest group的group_type,那必定不是busiest group
        return false;

    /*
     * This sg and busiest are classified as same. when prefer_spread    //当group_type相同时,需要进一步判断
     * is true, we want to maximize the chance of pulling taks, so        //如果有prefer_spread标记,那么我们就会最大化pull task的几率
     * prefer to pick sg with more runnable tasks and break the ties    //所以,更偏向于挑选有更多runnable task数量的sg,如果相等的情况下,用group_util来最终决定
     * with utilization.
     */
    if (env->prefer_spread) {
        if (sgs->sum_nr_running < busiest->sum_nr_running)
            return false;
        if (sgs->sum_nr_running > busiest->sum_nr_running)
            return true;
        return sgs->group_util > busiest->group_util;
    }

    if (sgs->avg_load <= busiest->avg_load)        //当上述条件无法找出差别时,再对比avg_load;如果<= busiets group,那就认为其肯定不是busiest group
        return false;

    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))    //如果当前sd是MC level(只有DIE level才有asym_cpucapacity flag)
        goto asym_packing;                            //那么执行goto

    /*
     * Candidate sg has no more than one task per CPU and        //候选的sg中task总数少于每个cpu至少一个task的情况
     * has higher per-CPU capacity. Migrating tasks to less        //并且,候选sg有比local group更高的cpu capacity
     * capable CPUs may harm throughput. Maximize throughput,        //迁移task导更小能力的cpu上会对throughput有影响
     * power/energy consequences are not considered.                //最大化throughput,那么就不应考虑功耗相关的后果
     */
    if (sgs->sum_nr_running <= sgs->group_weight &&            //group_weight其实就是sgs内cpu个数
        group_smaller_min_cpu_capacity(sds->local, sg))        //满足返回true:local sg->min_capacity * margin_up < 当前sg->min_capacity * 1024,其中margin_up为1078,大约5%的up
        return false;                                        //满足上述2个条件,则认为当前sg不是busiest group

    /*
     * Candidate sg doesn't face any severe imbalance issues so        //候选sg不遇到严重的不平衡,就不会进行打断;除非是groups之间相似的capacity,做balance会比不平衡的危害更小,那就做balance
     * don't disturb unless the groups are of similar capacity
     * where balancing is more harmless.
     */
    if (sgs->group_type == group_other &&                //当前sg是普通sg(不是不平衡较严重的sg)
        !group_similar_cpu_capacity(sds->local, sg))    //满足条件返回true:2个group对比min_capacity,差距<12.5%
        return false;                                    //满足上述2个条件,则认为当前sg不是busiest group

    /*
     * If we have more than one misfit sg go with the biggest misfit.    //如果有多个misfit sg,那么就选misfit最严重的sg为busiest;即过滤不严重的misfit
     */
    if (sgs->group_type == group_misfit_task &&                                //当前sg是misfit task的sg
        sgs->group_misfit_task_load < busiest->group_misfit_task_load)        //比较group_misfit_task_load,更大的才是busiest group
        return false;

asym_packing:

    /* This is the busiest node in its class. */
    if (!(env->sd->flags & SD_ASYM_PACKING))            //当前平台不支持这个flag,所以,这里必然返回true
        return true;
                                                            //下面的流程,就暂不分析了。
    /* No ASYM_PACKING if target CPU is already busy */
    if (env->idle == CPU_NOT_IDLE)
        return true;
    /*
     * ASYM_PACKING needs to move all the work to the highest
     * prority CPUs in the group, therefore mark all groups
     * of lower priority than ourself as busy.
     */
    if (sgs->sum_nr_running &&
        sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
        if (!sds->busiest)
            return true;

        /* Prefer to move from lowest priority CPU's work */
        if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
                      sg->asym_prefer_cpu))
            return true;
    }

    return false;
}

 (2-2-2)计算imbalance值

/**
 * calculate_imbalance - Calculate the amount of imbalance present within the
 *             groups of a given sched_domain during load balance.
 * @env: load balance environment
 * @sds: statistics of the sched_domain whose imbalance is to be calculated.
 */
static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
    unsigned long max_pull, load_above_capacity = ~0UL;
    struct sg_lb_stats *local, *busiest;
    bool no_imbalance = false;

    local = &sds->local_stat;
    busiest = &sds->busiest_stat;

    if (busiest->group_type == group_imbalanced) {                        //如果busiest group处于imbalnace状态
        /*
         * In the group_imb case we cannot rely on group-wide averages    //不能再使用group等级的avg load来保证cpu load平衡
         * to ensure CPU-load equilibrium, look at wider averages. XXX    //所以需要更宽等级的avg load,即sds->avg_load
         */
        busiest->load_per_task =
            min(busiest->load_per_task, sds->avg_load);                    //busiest->load_per_task和sds->avg_load两者取小作为busiest的load_per_task
    }

    /*
     * Avg load of busiest sg can be less and avg load of local sg can    //sds/busiest/local avg load的计算除以的group cacacity的值可能不同
     * be greater than avg load across all sgs of sd because avg load    //并且当更新busiest sg时,busiest有更小group type的sgs,则直接return,不计算imbalance
     * factors in sg capacity and sgs with smaller group_type are
     * skipped when updating the busiest sg.
     */
    if (busiest->avg_load <= sds->avg_load ||                            //busiest avg load <= sds avg load
         local->avg_load >= sds->avg_load)                                //或者local avg load >= sds avg load
        no_imbalance = true;                                            //设置no_imbalance为 true,下面进一步判断

    if (busiest->group_type != group_misfit_task && no_imbalance) {        //如果busiest group没有misfit task且no_imbalance == true
        env->imbalance = 0;                                                //那么先强制设置imbalance为0
        if (busiest->group_type == group_overloaded &&                    //再判断busiest处于overload 且local的group type<= group_misfit_task(local相当于只有很小或者没有负载问题)
                local->group_type <= group_misfit_task) {
            env->imbalance = busiest->load_per_task;                    //上述那条满足,设置imbalance为busiest load_per_task
            return;                                                        //并直接return
        }
        return fix_small_imbalance(env, sds);                            //(2-2-2-1)如果上述条件不满足,则再计算小的imbalance
    }

    /*
     * If there aren't any idle CPUs, avoid creating some.    //如果没有任何idle cpu,避免迁移产生idle
     */
    if (busiest->group_type == group_overloaded &&                                //当busiest和local都处于overload时
        local->group_type   == group_overloaded) {
        load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;    //初步计算load_above_capacity = busiest中task数*1024
        if (load_above_capacity > busiest->group_capacity) {        //如果load_above_capacity > busiest的group capacity
            load_above_capacity -= busiest->group_capacity;            
            load_above_capacity *= scale_load_down(NICE_0_LOAD);    //则计算(load_above_capcity - busiest group capcity)*1024/busiestde group capacity。实际结果就是计算超过busiest group capcity的部分
            load_above_capacity /= busiest->group_capacity;            //如果busiest有只有1个task,那么这里计算得出的load_above_capacity也是0
        } else
            load_above_capacity = ~0UL;                                //如果有idle cpu,则不考虑load_above_capacity
    }

    /*
     * In case of a misfit task, independent of avg loads we do load balance    //万一有1个misfit task,独立于当前做load balance的sd level
     * at the parent sched domain level for B.L systems, so it is possible        //所以,有可能busiest的avg load < sd avg load
     * that busiest group avg load can be less than sd avg load.
     * So skip calculating load based imbalance between groups.                    //这种情况下,跳过计算group之间的imbalance
     */
    if (!no_imbalance) {
        /*
         * We're trying to get all the cpus to the average_load,            //我们正在尝试获取所有cpu的avg load,所以我们不想把自己搞成高于avg load
         * so we don't want to push ourselves above the average load,
         * nor do we wish to reduce the max loaded cpu below the average    //也不想将最大负载的cpu降到低于avg load
         * load. At the same time, we also don't want to reduce the            //同时,我们也不希望将group load降低到group capacity以下
         * group load below the group capacity.
         * Thus we look for the minimum possible imbalance.                    //因此,我们使用尽可能小的作为imbalance
         */
        max_pull = min(busiest->avg_load - sds->avg_load,                //取busiest->avg_load - sds->avg_load和load_above_capacity中较小的
                        load_above_capacity);

        /* How much load to actually move to equalise the imbalance */    //然后再计算local avg load与sds avg load的差值,并归一化
        env->imbalance = min(max_pull * busiest->group_capacity,        //同样取max_pull和上一行注释计算出的差值,乘上各自group capacity后的较小值。作为imbalance
                    (sds->avg_load - local->avg_load) *
                    local->group_capacity) /
                    SCHED_CAPACITY_SCALE;
    } else {
        /*
         * Skipped load based imbalance calculations, but let's find        //如果满足上面的skip条件,则跳过基于load计算imbalance
         * imbalance based on busiest group type or fix small imbalance.    //但是要计算基于busiest group type的imbalance或者计算小的imbalance
         */
        env->imbalance = 0;
    }

    /* Boost imbalance to allow misfit task to be balanced.        //boost imbalance可以运行对misfit task进行均衡
     * Always do this if we are doing a NEWLY_IDLE balance        //只要满足:我们是在做newly idle balance,且所有task都不会长时间运行(因此我们不能依靠load)
     * on the assumption that any tasks we have must not be
     * long-running (and hence we cannot rely upon load).
     * However if we are not idle, we should assume the tasks    //但是如果我们不是处于idle,
     * we have are longer running and not override load-based    //那我们应该假设task是长时间运行的,并且不会覆盖上面代码中基于load的计算
     * calculations above unless we are sure that the local        //除非我们保证local group没充分利用
     * group is underutilized.
     */
    if (busiest->group_type == group_misfit_task &&                //如果busiest有misfit task,并且满足下面的条件:
        (env->idle == CPU_NEWLY_IDLE ||                            //现在是newly idle balance,或者local的task数小于local拥有的cpu数
        local->sum_nr_running < local->group_weight)) {            
        env->imbalance = max_t(long, env->imbalance,            //那么就取之前基于load的imbalance,与busiest->group_misfit_task_load中的较大值,作为imbalance
                       busiest->group_misfit_task_load);
    }

    /*
     * if *imbalance is less than the average load per runnable task        //如果imbalance小于avg load per runnable task
     * there is no guarantee that any tasks will be moved so we'll have        //那么不能保证有task会被迁移
     * a think about bumping its value to force at least one task to be        //所以,我们会考虑提高一下imbalance值,来确保至少有一个task会被迁移
     * moved
     */
    if (env->imbalance < busiest->load_per_task) {                    //判断imbalance是否小于busiest->load_per_task
        /*
         * The busiest group is overloaded so it could use help                //busiest处于overload的话,它可以从其他group得到帮助
         * from the other groups. If the local group has idle CPUs            //如果local有idle cpu,并且local没有overload,所在group也没有imbalance
         * and it is not overloaded and has no imbalance with in            //那么允许提高imbalance来触发load balance
         * the group, allow the load balance by bumping the
         * imbalance.
         */
        if (busiest->group_type == group_overloaded &&
            local->group_type <= group_misfit_task &&
            env->idle != CPU_NOT_IDLE) {
            env->imbalance = busiest->load_per_task;        //最终提高imbalance = busiest->load_per_task(估计至少会迁移1个task)
            return;
        }

        return fix_small_imbalance(env, sds);                //上述条件不满足,则计算小的imbalance
    }
}

 (2-2-2-1)计算小的、次要的imbalance。实际满足条件就是将busiest->load_per_task作为imbalance,类似是触发load balance的二次机会

/**
 * fix_small_imbalance - Calculate the minor imbalance that exists
 *            amongst the groups of a sched_domain, during
 *            load balancing.
 * @env: The load balancing environment.
 * @sds: Statistics of the sched_domain whose imbalance is to be calculated.
 */
static inline
void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
    unsigned long tmp, capa_now = 0, capa_move = 0;
    unsigned int imbn = 2;
    unsigned long scaled_busy_load_per_task;
    struct sg_lb_stats *local, *busiest;

    local = &sds->local_stat;
    busiest = &sds->busiest_stat;

    if (!local->sum_nr_running)                                            //如果local group没有task
        local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);        //则设置local->load_per_task = dst_cpu的pelt负载 / dst_cpu的cfs task总数
    else if (busiest->load_per_task > local->load_per_task)            //如果busiest load_per_task > local的
        imbn = 1;                                                    //则设置imbn = 1

    scaled_busy_load_per_task =                                        //归一化busiest的load_per_task(scaled_busy_load_per_task) = busiest->load_per_task *1024 / busiest的group_capacity
        (busiest->load_per_task * SCHED_CAPACITY_SCALE) /            //这个应该就是相当于迁移走的busy load部分(归一化)
        busiest->group_capacity;

    if (busiest->avg_load + scaled_busy_load_per_task >=            //如果busiest->avg_load + scaled_busy_load_per_task >= local->avg_load +scaled_busy_load_per_task *imbn(倍数)
        local->avg_load + (scaled_busy_load_per_task * imbn)) {
        env->imbalance = busiest->load_per_task;                    //设imbalance为busiest->load_per_task
        return;
    }

    /*
     * OK, we don't have enough imbalance to justify moving tasks,        //如果,我们没有足够的imbalance来决定迁移task
     * however we may be able to increase total CPU capacity used by    //但我们可以通过迁移他们来提高总的被使用的cpu capacity
     * moving them.
     */
                                                                    //下面这3次capa_now的计算统计了busiest和local的用量总和
    capa_now += busiest->group_capacity *                            //如果用的是avg load,相当于计算回原来的group load
            min(busiest->load_per_task, busiest->avg_load);            //如果用的是load_per_task,那计算的是load_per_task * group_capacity /1024
    capa_now += local->group_capacity *
            min(local->load_per_task, local->avg_load);
    capa_now /= SCHED_CAPACITY_SCALE;

    /* Amount of load we'd subtract */                            //我们要减去的量(capa_move)
    if (busiest->avg_load > scaled_busy_load_per_task) {        //如果avg_load > scaled_busy_load,说明有可以迁移的load
        capa_move += busiest->group_capacity *                    //计算可以迁移的load = busiest group_capacity * (load_per_task和avg_load- scaled_busy_load的较小者)
                min(busiest->load_per_task,
                busiest->avg_load - scaled_busy_load_per_task);    //选较小的作为迁移量
    }

    /* Amount of load we'd add */                                //我们要增加的量(tmp)
    if (busiest->avg_load * busiest->group_capacity <            //判断 group_load 是否小于 load_per_task?(也是选择较小者)
        busiest->load_per_task * SCHED_CAPACITY_SCALE) {
        tmp = (busiest->avg_load * busiest->group_capacity) /    //busiest->group_load / local->group_capacity
              local->group_capacity;
    } else {
        tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) /    //busiest->load_per_task *1024 / local->group_capacity
              local->group_capacity;
    }
    capa_move += local->group_capacity *                        //计算迁移之后的用量总和(capa_move)
            min(local->load_per_task, local->avg_load + tmp);    //计算方法:local->group_capacity   * (load_per_task和avg_load+tmp的较小者) / 1024
    capa_move /= SCHED_CAPACITY_SCALE;

    /* Move if we gain throughput */
    if (capa_move > capa_now) {                            //判断移动之后,是否会将整体的用量总和是否会提高(目标看上去是尽量提高使用率)
        env->imbalance = busiest->load_per_task;
        return;
    }

    /* We can't see throughput improvement with the load-based        //我们不能用基于load的方法看到提高用量的话,
     * method, but it is possible depending upon group size and        //但是依赖group size和capacity范围,在有不同cpu capacity的系统上,
     * capacity range that there might still be an underutilized    //仍可能有一个没有充分利用的cpu。
     * cpu available in an asymmetric capacity system. Do one last    //所以,再做最后一次case检查
     * check just in case.
     */
    if (env->sd->flags & SD_ASYM_CPUCAPACITY &&                //当前是DIE level
        busiest->group_type == group_overloaded &&            //busiest group处于overload
        busiest->sum_nr_running > busiest->group_weight &&    //busiest上task数量超过group内cpu核数
        local->sum_nr_running < local->group_weight &&        //local上task数量小于group内cpu核数
        local->group_capacity < busiest->group_capacity)    //local的group capacity < busiest的group capacity
        env->imbalance = busiest->load_per_task;
}

 (2-3)从busiest group与env->cpus交集中,遍历找出busiest rq中,即负载最重cpu对应的rq。

/*
 * find_busiest_queue - find the busiest runqueue among the CPUs in the group.
 */
static struct rq *find_busiest_queue(struct lb_env *env,
                     struct sched_group *group)
{
    struct rq *busiest = NULL, *rq;
    unsigned long busiest_load = 0, busiest_capacity = 1;
    int i;

    for_each_cpu_and(i, sched_group_span(group), env->cpus) {    //从busiest group与env->cpus的交集中遍历cpu
        unsigned long capacity, load;
        enum fbq_type rt;

        rq = cpu_rq(i);
        rt = fbq_classify_rq(rq);            //我们没有打开numa balancing,所以这里永远为regular(0)

        /*
         * We classify groups/runqueues into three groups:
         *  - regular: there are !numa tasks
         *  - remote:  there are numa tasks that run on the 'wrong' node
         *  - all:     there is no distinction
         *
         * In order to avoid migrating ideally placed numa tasks,
         * ignore those when there's better options.
         *
         * If we ignore the actual busiest queue to migrate another
         * task, the next balance pass can still reduce the busiest
         * queue by moving tasks around inside the node.
         *
         * If we cannot move enough load due to this classification
         * the next pass will adjust the group classification and
         * allow migration of more tasks.
         *
         * Both cases only affect the total convergence complexity.
         */
        if (rt > env->fbq_type)            //这里永远不会满足,不会continue
            continue;

        /*
         * For ASYM_CPUCAPACITY domains with misfit tasks we simply    //如果busiest是misfit的情况,那只要找最大的misfit task
         * seek the "biggest" misfit task.
         */
        if (env->src_grp_type == group_misfit_task) {        //判断busiest
            if (rq->misfit_task_load > busiest_load) {        //如果rq的misfit_task_load 大于busiest_load(初始为0)
                busiest_load = rq->misfit_task_load;        //则将misfit task load赋值给busiest_load
                busiest = rq;                                //拥有最大的misift load的rq也就是最繁忙的
            }

            continue;
        }

        /*
         * Ignore cpu, which is undergoing active_balance and doesn't        //cpu如果处于active load balance而且task数<=2
         * have more than 2 tasks.                                            //就直接跳过,肯定不是busiest rq
         */
        if (rq->active_balance && rq->nr_running <= 2)
            continue;

        capacity = capacity_of(i);            //获取cpu_capacity,即留给CFS task的cpu capacity

        /*
         * For ASYM_CPUCAPACITY domains, don't pick a CPU that could        //对于拥有大小核的DIE level,不要选择最终导致
         * eventually lead to active_balancing high->low capacity.            //active balance 从high capcity到low 迁移
         * Higher per-CPU capacity is considered better than balancing        //因为越高per-cpu的capacity比做平均load balance更好
         * average load.
         */
        if (env->sd->flags & SD_ASYM_CPUCAPACITY &&                //如果是在DIE level,而且
            capacity_of(env->dst_cpu) < capacity &&                //目标cpu的cpu capacity < 该循环遍历到的cpu capacity,而且满足:
            (rq->nr_running == 1 ||                                //rq只有一个task 或者 rq有2个task但rq上curr的task util < sched_small_task_threshold(102,约等于10%最大cpu capacity)
                (rq->nr_running == 2 && task_util(rq->curr) <
                sched_small_task_threshold)))
            continue;

        load = cpu_runnable_load(rq);        //获取该循环遍历的cpu负载

        /*
         * When comparing with imbalance, use cpu_runnable_load()    //用没有基于cpu capacity归一化后的cpu负载与imbalance对比
         * which is not scaled with the CPU capacity.
         */

        if (rq->nr_running == 1 && load > env->imbalance &&            //如果rq只有1个task,并且load > imbalance
            !check_cpu_capacity(rq, env->sd))                        //并且是否因副作用导致capacity减少:rq的cpu_capacity * 对应sd的imbalance_pct >= cpu_capacity_orig *100
            continue;

        /*
         * For the load comparisons with the other CPU's, consider        //对于对于不同cpu之间比较负载的情况,考虑
         * the cpu_runnable_load() scaled with the CPU capacity, so        //使用经过cpu capacity归一化后的cpu负载,
         * that the load can be moved away from the CPU that is            //这样就能把负载从更低capacity的cpu上迁移走
         * potentially running at a lower capacity.
         *
         * Thus we're looking for max(load_i / capacity_i), crosswise    //因此我们寻找(load/capacity)的最大值
         * multiplication to rid ourselves of the division works out    //利用交叉相乘:a/b > c/d ---> a*d > c*b
         * to: load_i * capacity_j > load_j * capacity_i;  where j is    //最终a/b对应的cpu rq就是最大值
         * our previous maximum.
         */
        if (load * busiest_capacity >= busiest_load * capacity) {    //如上面所述,进行比较
            busiest_load = load;                                    //遍历将busiest load保存
            busiest_capacity = capacity;                            //将busiest cpu_capacity保存
            busiest = rq;                                            //保存busiest rq
        }
    }

    return busiest;
}

 (2-4)进行迁移detach操作,从原rq脱离,返回的是迁移的task数量

/*
 * detach_tasks() -- tries to detach up to imbalance runnable load from
 * busiest_rq, as part of a balancing operation within domain "sd".
 *
 * Returns number of detached tasks if successful and 0 otherwise.
 */
static int detach_tasks(struct lb_env *env)
{
    struct list_head *tasks = &env->src_rq->cfs_tasks;    //取出繁忙的rq中取出cfs rq队列
    struct task_struct *p;
    unsigned long load = 0;
    int detached = 0;
    int orig_loop = env->loop;

    lockdep_assert_held(&env->src_rq->lock);

    if (env->imbalance <= 0)        //过滤没有imbalance的情况,不进行task迁移
        return 0;

    if (!same_cluster(env->dst_cpu, env->src_cpu))            //如果迁移是跨cluster的
        env->flags |= LBF_IGNORE_PREFERRED_CLUSTER_TASKS;    //置位flag:LBF_IGNORE_PREFERRED_CLUSTER_TASKS

    if (capacity_orig_of(env->dst_cpu) < capacity_orig_of(env->src_cpu))    //如果是从orig cpu capacity大的cpu迁移到小的cpu
        env->flags |= LBF_IGNORE_BIG_TASKS;                                    //置位flag:LBF_IGNORE_BIG_TASKS

redo:
    while (!list_empty(tasks)) {    //遍历cfs task
        /*
         * We don't want to steal all, otherwise we may be treated likewise,    //我们不想移动所有task,否则我们同样可能被欺骗
         * which could at worst lead to a livelock crash.                         //从而最差可能导致活锁crash
         */
        if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)    //idle或者newly idle 并且src rq中cfs task <=1,就跳出
            break;

        p = list_last_entry(tasks, struct task_struct, se.group_node);    //取队列中最后一个task p

        env->loop++;        //detach task的循环检查遍历统计(并不一定真正detach)
        /* We've more or less seen every task there is, call it quits */
        if (env->loop > env->loop_max)        //如果遍历次数超过了最大限制,就跳出
            break;

        /* take a breather every nr_migrate tasks */
        if (env->loop > env->loop_break) {                //当迁移检查task的数量超过sched_nr_migrate_break
            env->loop_break += sched_nr_migrate_break;    //则需要休息一会
            env->flags |= LBF_NEED_BREAK;                //置位flag:LBF_NEED_BREAK
            break;
        }

        if (!can_migrate_task(p, env))        //(2-4-1)判断task p是否能迁移到当前this cpu上
            goto next;

        /*
         * Depending of the number of CPUs and tasks and the    //根据cpu、task和cgroup的层级,
         * cgroup hierarchy, task_h_load() can return a null    //task_h_load()可以返回NULL。
         * value. Make sure that env->imbalance decreases        //所以,为了确保env->imbalance能下降,
         * otherwise detach_tasks() will stop only after        //否则detach_tasks()会知道loop_max才能停下来
         * detaching up to loop_max tasks.
         */
        load = max_t(unsigned long, task_h_load(p), 1);        //限制最小load = 1


        if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)    //LB_MIN特性用于限制迁移小任务,如果LB_MIN等于true,那么task load小于16的任务将不参与负载均衡。目前LB_MIN系统缺省设置为false
            goto next;

        /*
         * p is not running task when we goes until here, so if p is one    //当跑到这里时,task p就不处于running了
         * of the 2 task in src cpu rq and not the running one,                //所以,task p是src rq其中之一,并且是处于非running的一个
         * that means it is the only task that can be balanced.                //这意味着只有这个tasky p可以进行均衡迁移。
         * So only when there is other tasks can be balanced or                //所以,仅当有其他task能迁移(task数量>2),
         * there is situation to ignore big task, it is needed                //或者需要忽略big task的场景,
         * to skip the task load bigger than 2*imbalance.                    //满足上述条件后,如果task load 大于 2倍的imbalance,那么就skip,不考虑task p
         *
         * And load based checks are skipped for prefer_spread in        //并且对于prefer_spread,会跳过基于load的busiest group寻找检查
         * finding busiest group, ignore the task's h_load.                //这部分也会忽略
         */
        if (!env->prefer_spread &&                            //1.没有设置prefer_spread
            ((cpu_rq(env->src_cpu)->nr_running > 2) ||        //2.src rq的cfs task超过2
            (env->flags & LBF_IGNORE_BIG_TASKS)) &&            //  或者 设置了忽略big task的flag
            ((load / 2) > env->imbalance))                    //3.load > 2倍imbalance
            goto next;                                        //同事满足1.2.3.条件,则跳过该task p

        detach_task(p, env);                        //(2-4-2)根据迁移env,进行task剥离
        list_add(&p->se.group_node, &env->tasks);    //剥离后的task,同时先挂在env->tasks链表上

        detached++;                    //task剥离的计数统计
        env->imbalance -= load;        //剥离之后,计算剩余的imbalance

#ifdef CONFIG_PREEMPTION
        /*
         * NEWIDLE balancing is a source of latency, so preemptible            //newly idle的情况,对latency要求高。如果有抢占功能的kernel,
         * kernels will stop after the first task is detached to minimize    //则当第一个task剥离之后,需要停止进一步的剥离。
         * the critical section.
         */
        if (env->idle == CPU_NEWLY_IDLE)
            break;
#endif

        /*
         * We only want to steal up to the prescribed amount of        //我们只想平衡之前提到的load量,不要过度迁移
         * runnable load.
         */
        if (env->imbalance <= 0)            //imbalance消除后,就不再需要过度detach了
            break;

        continue;        //如果imbalance没有抹平,或者没有达到跳出循环的条件,就继续进行下一个task检查尝试detach
next:
#ifdef CONFIG_SCHED_WALT
        trace_sched_load_balance_skip_tasks(env->src_cpu, env->dst_cpu,
                env->src_grp_type, p->pid, load, task_util(p),
                cpumask_bits(&p->cpus_mask)[0]);
#endif
        list_move(&p->se.group_node, tasks);            //走到这里说明上面走到了skip的条件,把检查过的task p塞回tasks队列head(因为检查是从list尾开始的)
    }

    if (env->flags & (LBF_IGNORE_BIG_TASKS |              //如果env->flags中设置了ignore bit task或者ignore perferred cluster task
            LBF_IGNORE_PREFERRED_CLUSTER_TASKS) && !detached) {   //并且此次没有满足条件的task detach(detach task数量为0)
        tasks = &env->src_rq->cfs_tasks;            //重新赋值src rq上的cfs task链表
        env->flags &= ~(LBF_IGNORE_BIG_TASKS |         //去掉这2个flag
                LBF_IGNORE_PREFERRED_CLUSTER_TASKS);
        env->loop = orig_loop;                  //重新赋值原始的loop计数,之前的遍历计数不算
        goto redo;
    }

    /*
     * Right now, this is one of only two places we collect this stat
     * so we can safely collect detach_one_task() stats here rather
     * than inside detach_one_task().
     */
    schedstat_add(env->sd->lb_gained[env->idle], detached);        //统计调度相关统计数据:detach task的数量

    return detached;    //返回的是detach的task数量
}

 (2-4-1)判断task p是否能迁移到当前this cpu上

/*
 * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
 */
static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
    int tsk_cache_hot;
    int can_migrate = 1;

    lockdep_assert_held(&env->src_rq->lock);

    trace_android_rvh_can_migrate_task(p, env->dst_cpu, &can_migrate);    //GKI的trace hook,当作没有即可
    if (!can_migrate)
        return 0;

    /*
     * We do not migrate tasks that are:                        //我们有以下这几种情况是不会迁移task的:
     * 1) throttled_lb_pair, or                                  //1.throttled_lb_pair
     * 2) cannot be migrated to this CPU due to cpus_ptr, or     //2.cpuset限制不能迁移
     * 3) running (obviously), or                                //3.task正处于running
     * 4) are cache-hot on their current CPU.                    //4.在当前cpu上是cache hot
     */
    if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))    //判断src rq、dst rq是否处于cfs带宽限制状态?如果是,则不迁移
        return 0;

    /*
     * don't allow pull boost task to smaller cores.    //不允许把boost task从大核迁移到小核
     */
    if (!can_migrate_boosted_task(p, env->src_cpu, env->dst_cpu))    //task处于task boost,且task处于rtg中,且dst cpu的orig_capacity < src cpu的
        return 0;

#ifdef CONFIG_SCHED_WALT
    if (p->wts.iowaited && is_min_capacity_cpu(env->dst_cpu) &&        //task处于iowait,且dst cpu是最小核,且src cpu不是最小核
            !is_min_capacity_cpu(env->src_cpu))
        return 0;
#endif

    if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {        //受cpuset限制,dst cpu不能运行task p
        int cpu;

        schedstat_inc(p->se.statistics.nr_failed_migrations_affine);    //统计因亲和度导致迁移失败的调度信息统计

        env->flags |= LBF_SOME_PINNED;        //flag记录 some pinned

        /*
         * Remember if this task can be migrated to any other CPU in    //记住如果这个task能被迁移到我们sg的任一cpu
         * our sched_group. We may want to revisit it if we couldn't    //而且如果从src cpu上pull其他task不能解决imbalance
         * meet load balance goals by pulling other tasks on src_cpu.    //那么我们可以再尝试这个task
         *
         * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have        //避免考虑newly idle情况下设置new_dst_cpu
         * already computed one in current iteration.                    //或者如果我们已经在当前这次遍历中已经设置过了(通过判断LBF_DST_PINNED)
         */
        if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
            return 0;

        /* Prevent to re-select dst_cpu via env's CPUs: */        //防止通过env->cpus重新选择dst cpu
        for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {    //遍历dst_grpmask与env->cpus的交集
            if (cpumask_test_cpu(cpu, p->cpus_ptr)) {        //判断新cpu是否满足task p的cpuset限制
                env->flags |= LBF_DST_PINNED;                //如满足,则设置flag:LBF_DST_PINNED
                env->new_dst_cpu = cpu;                        //并将其设置为new_dst_cpu
                break;
            }
        }

        return 0;
    }

    /* Record that we found atleast one task that could run on dst_cpu */
    env->flags &= ~LBF_ALL_PINNED;        //跑到这里,说明至少有一个task而已跑到dst cpu上

#ifdef CONFIG_SCHED_WALT
    if (static_branch_unlikely(&sched_energy_present)) {                //如果开启了EAS
        struct root_domain *rd = env->dst_rq->rd;

        if ((rcu_dereference(rd->pd) && !sd_overutilized(env->sd)) &&    //perf domain存在,且env->sd不处于overutil
            env->idle == CPU_NEWLY_IDLE && !env->prefer_spread &&        //且是newly idle,且没有设置prefer_spread
            !task_in_related_thread_group(p)) {                            //且task p处于rtg
            long util_cum_dst, util_cum_src;
            unsigned long demand;

            demand = task_util(p);                                    //获取task p的util
            util_cum_dst = cpu_util_cum(env->dst_cpu, 0) + demand;    //计算迁移task p后的dst cumlative_cpu_util
            util_cum_src = cpu_util_cum(env->src_cpu, 0) - demand;    //计算迁移task p后的src cumlative_cpu_util

            if (util_cum_dst > util_cum_src)        //不能迁移后,dst cumlative_cpu_util > src cumlative_cpu_util
                return 0;
        }
    }

    if (env->flags & LBF_IGNORE_PREFERRED_CLUSTER_TASKS &&        //如果是跨cluster的迁移,
             !preferred_cluster(                                //且task p设置的优先调度cluster与dst cpu所在cluster不匹配
                cpu_rq(env->dst_cpu)->wrq.cluster, p))
        return 0;

    /* Don't detach task if it doesn't fit on the destination */
    if (env->flags & LBF_IGNORE_BIG_TASKS &&            //从orig cpu capacity大的cpu迁移到小的cpu的情况
        !task_fits_max(p, env->dst_cpu))                //且task不适合dst cpu capacity
        return 0;

    /* Don't detach task if it is under active migration */
    if (env->src_rq->wrq.push_task == p)        //task p正在处于active migration
        return 0;
#endif

    if (task_running(env->src_rq, p)) {            //task p正处于running
        schedstat_inc(p->se.statistics.nr_failed_migrations_running);    //统计因task处于running导致迁移失败的调度信息统计
        return 0;
    }

#ifdef CONFIG_SCHED_WALT
    if ((env->idle == CPU_NEWLY_IDLE) &&            //newly idle的情况
        is_min_capacity_cpu(env->dst_cpu) &&        //且dst是小核
        !is_min_capacity_cpu(env->src_cpu) &&        //且src是大核
        walt_get_rtg_status(p)) {                    //且task p在rtg内,同时设置了大核cluster的偏好
        bool pull_to_silver_allowed = false;
        unsigned int cpu;

        for_each_cpu(cpu, env->cpus) {            //遍历env->cpus
            if (!is_min_capacity_cpu(cpu) &&    //不是小核,且
                cpu_overutilized(cpu)) {        //cpu处于overutil
                pull_to_silver_allowed = true;    //那么可以考虑从大核迁移到小核
                break;
            }
        }

        if (!pull_to_silver_allowed)    //如果上面条件判断不满足,则跳出
            return 0;
    }
#endif

    /*
     * Aggressive migration if:                 //同意激进地迁移task的条件,满足一下任一:
     * 1) IDLE or NEWLY_IDLE balance.           //1.idle或者newly idle balance
     * 2) destination numa is preferred         //2.dst满足是numa偏向(当前平台UMA,永不满足这一条)
     * 3) task is cache cold, or                //3.task处于cache cold(其实是task hot)
     * 4) too many balance attempts have failed.//4.有太多balance尝试失败了
     */
    tsk_cache_hot = migrate_degrades_locality(p, env);        //因为当前平台是UMA架构,所以永远返回-1
    if (tsk_cache_hot == -1)
        tsk_cache_hot = task_hot(p, env);        //判断是否task p是否task_hot

    if (env->idle != CPU_NOT_IDLE || tsk_cache_hot <= 0 ||            //idle或者newly idle,或者task是cold
        env->sd->nr_balance_failed > env->sd->cache_nice_tries) {    //或者失败balance的次数超过了阈值(每个sd不同)
        if (tsk_cache_hot == 1) {                                        //如果该task是task-hot
            schedstat_inc(env->sd->lb_hot_gained[env->idle]);            //则记录相关统计信息:lb_hot_gained[env->idle]
            schedstat_inc(p->se.statistics.nr_forced_migrations);        //nr_forced_migrations
        }
        return 1;
    }

    schedstat_inc(p->se.statistics.nr_failed_migrations_hot);    //如果最终失败了,则记录统计信息:nr_failed_migrations_hot
    return 0;
}

这里简单看一下task hot的判断过程:

  1. 不是cfs task,不是hot
  2. task的调度policy为idle policy,不是hot
  3. task在原cpu上即将被调度到(已经设置为next或last buddy),是hot。------注意这里是CFS Buddy的概念,不是内存管理的伙伴系统。
  4. 迁移阈值sysctl_sched_migration_cost为-1,则是hot
  5. 迁移阈值sysctl_sched_migration_cost为0,则不是hot
  6. 运行时间 < 迁移阈值sysctl_sched_migration_cost,是task_hot;反之,不是hot
/*
 * Is this task likely cache-hot:
 */
static int task_hot(struct task_struct *p, struct lb_env *env)
{
    s64 delta;

    lockdep_assert_held(&env->src_rq->lock);

    if (p->sched_class != &fair_sched_class)
        return 0;

    if (unlikely(task_has_idle_policy(p)))
        return 0;

    /*
     * Buddy candidates are cache hot:
     */
    if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&
            (&p->se == cfs_rq_of(&p->se)->next ||
             &p->se == cfs_rq_of(&p->se)->last))
        return 1;

    if (sysctl_sched_migration_cost == -1)
        return 1;
    if (sysctl_sched_migration_cost == 0)
        return 0;

    delta = rq_clock_task(env->src_rq) - p->se.exec_start;

    return delta < (s64)sysctl_sched_migration_cost;
}

(2-4-2)根据迁移env,进行task剥离

/*
 * detach_task() -- detach the task for the migration specified in env
 */
static void detach_task(struct task_struct *p, struct lb_env *env)
{
    lockdep_assert_held(&env->src_rq->lock);

    deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);        //(2-4-2-1)dequeue task出队
    lockdep_off();
    double_lock_balance(env->src_rq, env->dst_rq);            //两个rq一起上锁
    if (!(env->src_rq->clock_update_flags & RQCF_UPDATED))
        update_rq_clock(env->src_rq);        //更新src rq的rq clock
    set_task_cpu(p, env->dst_cpu);                            //(2-4-2-2)迁移工作
    double_unlock_balance(env->src_rq, env->dst_rq);        //两个rq一起解锁
    lockdep_on();
}

(2-4-2-1)dequeue task出队

void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
    p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;    //当前detach task的路径走到这里,on_rq会置为TASK_ON_RQ_MIGRATING(2)

    if (task_contributes_to_load(p))    //根据task->state确认处于uninterruptible,且flags不处于freeze,且state不处于TASK_NOLOAD
        rq->nr_uninterruptible++;        //则增加rq上阻塞task计数+1(说明task迁移的中途是算处于uninterrupted)

#ifdef CONFIG_SCHED_WALT
    if (flags & DEQUEUE_SLEEP)        //当前没有这个flag,所以不会走清理ed task
        clear_ed_task(p, rq);
#endif

    dequeue_task(rq, p, flags);        //将task出rq
}

 

(2-4-2-2)迁移工作:

  • 更新src rq、dst rq、task p的walt负载。并从src rq上减去p的负载,再把p的负载加到dst rq上
  • 改变task p的cfs rq和parent entity
  • 设置p->cpu、->wake_cpu为dst cpu
void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
...(这里去掉了一些debug的代码)
    trace_sched_migrate_task(p, new_cpu);

    if (task_cpu(p) != new_cpu) {                    //判断是迁移task的情况
        if (p->sched_class->migrate_task_rq)
            p->sched_class->migrate_task_rq(p, new_cpu);    //cfs task调用migrate_task_rq_fair
        p->se.nr_migrations++;                        //调度统计task p迁移的次数
        rseq_migrate(p);                    //设置rseq状态:RSEQ_EVENT_MIGRATE_BIT、task flag:TIF_NOTIFY_RESUME
        perf_event_task_migrate(p);            //设置task->sched_migrated = 1

        fixup_busy_time(p, new_cpu);        //更新walt负载,并从src rq中去掉,加到dst rq上。较复杂,暂不展开了
    }

    __set_task_cpu(p, new_cpu);                //通过set_task_rq改变所在的cfs_rq和parent entity,并设置p->cpu、wake_cpu为dst cpu
}

 set_task_rq就是更新pelt,更改cfs rq/rt rq和parent se/parent rt_se

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
    struct task_group *tg = task_group(p);
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);    //以src rq的时间戳更新task p的pelt负载,然后再设置task p的pelt最后更新的时间戳为dst rq的时间戳
    p->se.cfs_rq = tg->cfs_rq[cpu];        //修改task p的cfs rq为tg中的dst cpu对应的cfs_rq
    p->se.parent = tg->se[cpu];            //修改task p的parent se为tg中的dst cpu对应的se
#endif

#ifdef CONFIG_RT_GROUP_SCHED
    p->rt.rt_rq  = tg->rt_rq[cpu];        //修改task p的rt rq为tg中的dst cpu对应的rt_rq
    p->rt.parent = tg->rt_se[cpu];        //修改task p的parent rt se为tg中的dst cpu对应的rt_se
#endif
}

 (2-5)将所有task attach到新的rq上

/*
 * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
 * new rq.
 */
static void attach_tasks(struct lb_env *env)
{
    struct list_head *tasks = &env->tasks;        //取出detach的所有task的链表
    struct task_struct *p;
    struct rq_flags rf;

    rq_lock(env->dst_rq, &rf);
    update_rq_clock(env->dst_rq);            //更新dst rq的clock

    while (!list_empty(tasks)) {
        p = list_first_entry(tasks, struct task_struct, se.group_node);        //从head取一个task
        list_del_init(&p->se.group_node);        //释放group_node

        attach_task(env->dst_rq, p);            //(2-5-1)将task p attach到dst cpu的rq上
    }

    /*
     * The enqueue_task_fair only updates the overutilized status    //enqueue_task_fair函数只会为唤醒的task更新overutil 状态
     * for the waking tasks. Since multiple tasks may get migrated  //因为复数task触发迁移,若不是在那运行
     * from load balancer, instead of doing it there, update the    //所以要在最终阶段更新overutil状态
     * overutilized status here at the end.
     */
    update_overutilized_status(env->dst_rq);    //(2-5-2)更新迁移过后的dst rq的overutil状态
    rq_unlock(env->dst_rq, &rf);
}

 (2-5-1)将task p attach到dst cpu的rq上

/*
 * attach_task() -- attach the task detached by detach_task() to its new rq.
 */
static void attach_task(struct rq *rq, struct task_struct *p)
{
    lockdep_assert_held(&rq->lock);

    BUG_ON(task_rq(p) != rq);
    activate_task(rq, p, ENQUEUE_NOCLOCK);    //(2-5-1-1)以不更新rq clock的flag,将task p压入rq
    check_preempt_curr(rq, p, 0);            //最终调用cfs class的check_preempt_wakeup函数来判断是否要抢占dst rq上的curr进程
}

(2-5-1-1)以不更新rq clock的flag,将task p压入rq

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
    if (task_contributes_to_load(p))    //根据task->state确认处于uninterruptible,且flags不处于freeze,且state不处于TASK_NOLOAD
        rq->nr_uninterruptible--;        //则增加rq上阻塞task计数11(说明task迁移的中途是算处于uninterrupted)

    enqueue_task(rq, p, flags);            //将task入rq

    p->on_rq = TASK_ON_RQ_QUEUED;        //on_rq设为1
}

(2-5-2)更新迁移过后的dst rq的overutil状态

static inline void update_overutilized_status(struct rq *rq)
{
#ifdef CONFIG_SCHED_WALT
    struct sched_domain *sd;

    rcu_read_lock();
    sd = rcu_dereference(rq->sd);        //获取dst rq->sd,即MC level的sd
    if (sd && !sd_overutilized(sd) &&    //判断sd存在,且sd没有设置overutil flag:sd->shared->overutilized
        cpu_overutilized(rq->cpu))        //且dst cpu处于overutil:orig_cpu_capcity * 1024 < cpu_util *1078   (相当于105%)
        set_sd_overutilized(sd);        //设置sd over util flag:sd->shared->overutilized = true
    rcu_read_unlock();
#else
    if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
        WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
        trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
    }
#endif
}

 (2-6)判断是否需要active balance

static int need_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    if (voluntary_active_balance(env))        //(2-6-1)判断是否满足主动触发active balance的条件
        return 1;

    if ((env->idle != CPU_NOT_IDLE) &&                                //src cpu处于idle状态
        (capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&    //src cpu_capacity < dst cpu_capacity
        ((capacity_orig_of(env->src_cpu) <                            //且src orig_cpu_capacity < dst orig_cpu_capacity
                capacity_orig_of(env->dst_cpu))) &&                    
                env->src_rq->cfs.h_nr_running == 1 &&                //且src cpu只有一个cfs task
                cpu_overutilized(env->src_cpu) &&                    //且src cpu处于overutil状态
                !cpu_overutilized(env->dst_cpu)) {                    //且dst cpu不处于overutil状态
        return 1;
    }

    if (env->src_grp_type == group_overloaded && env->src_rq->misfit_task_load)    //src cpu所在group处于overload,且src cpu有misfit task load
        return 1;

    return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);    //当前sd balance fail的失败计数 > 当前sd->cache_nice_tries+2
}

 

(2-6-1)判断是否满足主动触发active balance的条件

static inline bool
voluntary_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    if (asym_active_balance(env))    //平台不支持SMT level,也就不会有flag:SD_ASYM_PACKING。所以这里永远为false
        return 1;

    /*
     * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.            //dst cpu处于idle,src cpu只有1个cfs task
     * It's worth migrating the task if the src_cpu's capacity is reduced    //如果src cpu的capacity由于其他sched class或者irq
     * because of other sched_class or IRQs if more capacity stays            //并且dst cpu有更大的capacity,那么是值得迁移task的
     * available on dst_cpu.
     */
    if ((env->idle != CPU_NOT_IDLE) &&                //src cpu处于idle
        (env->src_rq->cfs.h_nr_running == 1)) {        //src cpu只有一个cfs task
        if ((check_cpu_capacity(env->src_rq, sd)) &&    //满足src cpu_capacity * sd->imbalance_pct < src cpu_capacity_orig*100
            (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))    //满足src cpu_capacity * sd->imbalance_pct < dst cpu_capacity*100
            return 1;
    }

    if (env->idle != CPU_NOT_IDLE &&                //如果src cpu处于idle
            env->src_grp_type == group_misfit_task)    //src cpu所在group type == group_misfit_task
        return 1;

    return 0;
}

 

 

还在成长。。。