spark 负载均衡开关负载均衡pool

转载

charlesc 2024-01-13 21:28:46

文章标签 spark 负载均衡开关 ci 负载均衡 #ifdef 文章分类 Spark 大数据

负载均衡（Load balance），是为了将执行task的工作量较平均地配分到每个cpu上，达到功耗和性能平衡的一种机制。比如很多task都只放在cpu0上执行，既不能保证节省功耗（因为负载中，需要提升cpu频率；整体执行时间长），也不能保证task及时执行，从而导致卡顿，且耗电。而有了负载均衡就会触发task迁移，按照一定的规则，将task较合理地分配给每个cpu，既能保证功耗，也可以提升性能。

负载均衡主要的操作是以下2种：

1、pull：负载轻的CPU，从负载繁重的CPU pull tasks来。这是主要的load balance方式，因为不该让负载本身就繁重的CPU执行负载均衡任务。

2、push：负载重的CPU，向负载轻的CPU。这种操作也叫做active load balance。

spark 负载均衡开关负载均衡pool_spark 负载均衡开关

负载均衡，在不同的情况下，大致分以下几种：

busy balance（periodic balance）：cpu上有task运行，是否需要做load balance。

idle balance：指cpu已经进入idle状态，在tick时做的load balance。如果是NO HZ的话，会尝试做nohz idle balance。

newly idle balance：CPU上没有可运行的task，准备进入idle 的状态。在这种情况下，scheduler会尝试从别的CPU上pull一些进程过来运行。

active load balance：在尝试load balance几次失败后，根据条件判断进行active balance。会从负载重的cpu，向负载轻的cpu，推送（push）task。（执行push的是running的task）

nohz idle balance：检查当前CPU是否负载过高等等情况，是否需要唤醒一个idle CPU让其pull 一些进程过去，分担一下负载压力。唤醒会通过IPI_RESCHEDULE 让 idle cpu退出power collapse。

本来打算一篇文章把所有触发的各种balance都写完，但是发现内容有点多，所以需要拆分一下。目前这篇先主要分析在定时调度触发的load balance（periodic）及其相关调用流程。其余后续再补充。

本文代码基于CAF-kernel-4.19和CAF-kernel-5.4。由于目前由于理解还不熟，不免有错误之处，烦请指正。

Periodic Load Balance

在scheduler_tick中，会周期性检查和触发load balance：

scheduler_tick()
{
。。。
#ifdef CONFIG_SMP
    rq->idle_balance = idle_cpu(cpu);
    trigger_load_balance(rq);       //触发load balance
#endif   
。。。 
}

在 trigger_load_balance() 中，会触发一个SOFTIRQ。之后的 nohz_balancer_kick()函数为nohz idle balance工作，后面再分析。

/*
 * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
 */
void trigger_load_balance(struct rq *rq)
{
    /* Don't need to rebalance while attached to NULL domain or
     * cpu is isolated.
     */
    if (unlikely(on_null_domain(rq)) || cpu_isolated(cpu_of(rq)))
        return;

    if (time_after_eq(jiffies, rq->next_balance))
        raise_softirq(SCHED_SOFTIRQ);           //触发SOFTIRQ中断

    nohz_balancer_kick(rq);         //属于nohz idle balance，暂不分析
}

那么 softirq 触发之后，会调用到哪里呢？首先看这个 SOFTIRQ 注册的地方，在fair调度class初始化的时候，有对 SOFTIRQ 中断进行注册，实际会触发run_rebalance_domains()函数。

__init void init_sched_fair_class(void)
{
#ifdef CONFIG_SMP
    open_softirq(SCHED_SOFTIRQ, run_rebalance_domains); //注册SOFTIRQ中断处理

#ifdef CONFIG_NO_HZ_COMMON
    nohz.next_balance = jiffies;
    nohz.next_blocked = jiffies;
    zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
#endif
#endif /* SMP */

}

所以触发SOFTIRQ后，就会调用处理函数run_rebalance_domains()。

/*
 * run_rebalance_domains is triggered when needed from the scheduler tick.
 * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
 */
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
    struct rq *this_rq = this_rq();
    enum cpu_idle_type idle = this_rq->idle_balance ?
                        CPU_IDLE : CPU_NOT_IDLE;

    /*
     * Since core isolation doesn't update nohz.idle_cpus_mask, there
     * is a possibility this nohz kicked cpu could be isolated. Hence
     * return if the cpu is isolated.
     */
    if (cpu_isolated(this_rq->cpu))     //过滤已经isolate的cpu
        return;

    /*
     * If this CPU has a pending nohz_balance_kick, then do the
     * balancing on behalf of the other idle CPUs whose ticks are
     * stopped. Do nohz_idle_balance *before* rebalance_domains to
     * give the idle CPUs a chance to load balance. Else we may
     * load balance only within the local sched_domain hierarchy
     * and abort nohz_idle_balance altogether if we pull some load.
     */
    if (nohz_idle_balance(this_rq, idle))   //为已经进入idle的所有cpu进行负载均衡，因为idle cpu的调度tick已经停止了。（这块后面分析）
        return;                             //同时，可以在做调度域均衡前，先给idle cpu进行负载均衡；否则我们仅仅会在所在的调度域这一层内完成负载均衡，如果我们pull出来一些load的话，也会中止nohz_idle_balance。

    /* normal load balance */
    update_blocked_averages(this_rq->cpu);      //计算当前cpu rq及其中各层cfs rq的相关负载load
    rebalance_domains(this_rq, idle);           //（1）检查各层sd是否要进行load balance
}

（1）在 rebalance_doamins() 中，遍历当前cpu的sd（从当前sd到sd->parent），并且：

1.对每层sd的newidle_lb_cost进行老化，按照1%/s（每秒衰减1%）的速度。并且把所有sd的值都累加起来，作为rq的newidle_lb_cost，但是它有最小值限制，不会低于50ms。（暂时还不知道newidle_lb_cost这个有什么用？？）

2. 对每一层sd判断是否需要进行load balance，并根据结果启动balance动作。遍历顺序：MC->DIE

3. 每个rq级别balance检查间隔，默认为 60s。如果sd的下次检查balance的时间，早于rq的检查时间，那么就会将rq的检查时间对齐到sd的检查时间。（也就是说，sd如果更早，那么rq也要跟着改到与sd一样的时刻检查）

4. 每个sd检查间隔为：sd->last_balance + sd->balance_interval（越高层的domain，间隔时间越大，因为迁移task的代价越大；）

5. 有一些情况会跳过当前sd的load balance检查：

　　---sd没有overutil，并且没有设置perfer spread idle（没有overutil表示系统没有出现严重过载；prefer spread idle是一个qcom feature，打开后会让task自由地迁移到当前cluster内的idle cpu上，以达到runnable task数量平均）

　　---sd->flags中没有设置SD_LOAD_BALANCE（表示没有load balance需求）

　　---continue_balance标志为0（表示sd中其他cpu的load balance更活跃，无需做重复操作）

　　6. 根据需要，判断是否更新rq的下次balance时间；同时判断：如果当前cpu为idle状态，并且nohz.next_balance晚于rq->next_balance，满足则同时更新nohz.next_balance = rq->next_balance

/*
 * It checks each scheduling domain to see if it is due to be balanced,
 * and initiates a balancing operation if so.
 *
 * Balancing parameters are set up in init_sched_domains.
 */
static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
    int continue_balancing = 1;
    int cpu = rq->cpu;
    unsigned long interval;
    struct sched_domain *sd;
    /* Earliest time when we have to do rebalance again */
    unsigned long next_balance = jiffies + 60*HZ;        //下次做负载均衡的deadline，当前时间往后60s
    int update_next_balance = 0;
    int need_serialize, need_decay = 0;
    u64 max_cost = 0;

    rcu_read_lock();
    for_each_domain(cpu, sd)  {        //遍历当前cpu所在的domain，并向上（parent）遍历domain
        /*
         * Decay the newidle max times here because this is a regular
         * visit to all the domains. Decay ~1% per second.
         */
        if (time_after(jiffies, sd->next_decay_max_lb_cost)) {    //对max_newidle_lb_cost每一秒衰减1%：max_newidle_lb_cost*253/256
            sd->max_newidle_lb_cost =
                (sd->max_newidle_lb_cost * 253) / 256;
            sd->next_decay_max_lb_cost = jiffies + HZ;        //HZ表示1s
            need_decay = 1;
        }
        max_cost += sd->max_newidle_lb_cost;        //统计所有domain的cost之和

        if (!sd_overutilized(sd) && !prefer_spread_on_idle(cpu))    //调度域没有overutil 并且 没有prefer spread on idle，那么直接continue，不需要进行load balance操作
            continue;

        if (!(sd->flags & SD_LOAD_BALANCE))        //没有设置SD_LOAD_BALANCE flag，则表示不需要做load balance操作，当前平台MC、DIE level都有这个flag
            continue;

        /*
         * Stop the load balance at this level. There is another
         * CPU in our sched group which is doing load balancing more
         * actively.
         */
        if (!continue_balancing) {        //如果continue_balancing == 0，则要停止当前层的load balance操作，因为在sched group中已经有其他cpu在执行load balance操作了
            if (need_decay)
                continue;
            break;
        }

        interval = get_sd_balance_interval(sd, idle != CPU_IDLE);        //获取当前调度域load balance所需的时间。越高层的domain，间隔时间越大，因为迁移task的代价越大。
        need_serialize = sd->flags & SD_SERIALIZE;        //是否要串行化执行load balance，NUMA架构才有带有这个flag，所以这里为0
        if (need_serialize) {
            if (!spin_trylock(&balancing))       
                goto out;
        }

        if (time_after_eq(jiffies, sd->last_balance + interval)) {            //时间间隔满足当前调度域的间隔要求后，做实际load balance动作
            if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {    //（2）load balance核心函数
                /*
                 * The LBF_DST_PINNED logic could have changed        //遇到dst cpu被拴住（原因可能是cpuset）的情况，
                 * env->dst_cpu, so we can't know our idle            //可能会改变env->dst_cpu（会寻找new dst cpu），
                 * state even if we migrated tasks. Update it.        //所以我们丢失了当前cpu的idle状态，需要重新获取
                 */
                idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;    //重新更新cpu idle状态
            }
            sd->last_balance = jiffies;        //更新load balance的时间戳
            interval = get_sd_balance_interval(sd, idle != CPU_IDLE);            //获取sd中balance间隔（在load balance中被tune过；如果sd内任一cpu有misfit task，则间隔强制写为1个jiffies == 1个tick）
        }
        if (need_serialize)                            //如上面设置了需要串行执行load balance
            spin_unlock(&balancing);                //则这里就要释放锁
out:
        if (time_after(next_balance, sd->last_balance + interval)) {                //如果sd下一次balance时间在，rq的balance时间之前，                    【rq下次执行load balance的时间：next_balance（默认是60s）】
            next_balance = sd->last_balance + interval;                                //那么就需要更新rq下次balance时间，对齐到sd下一次balance时间                    【调度域下次执行load balance的时间：sd->last_balance + interval】
            update_next_balance = 1;                                                //最终rq下次balance时间会对齐到，多个rq中最近的balance时间
        }
    }
    if (need_decay) {
        /*
         * Ensure the rq-wide value also decays but keep it at a        //rq上的值会老化，但是要保证值在一个合理的范围内，
         * reasonable floor to avoid funnies with rq->avg_idle.         //需要有一个最低值限制，以免与rq->avg_idle不匹配，出现笑话
         */
        rq->max_idle_balance_cost =
            max((u64)sysctl_sched_migration_cost, max_cost);            //max_idle_balance_cost最小值不能低于sysctl_sched_migration_cost（50ms）
    }
    rcu_read_unlock();

    /*
     * next_balance will be updated only when there is a need.
     * When the cpu is attached to null domain for ex, it will not be
     * updated.
     */
    if (likely(update_next_balance)) {
        rq->next_balance = next_balance;            //更新rq的下次balance时间

#ifdef CONFIG_NO_HZ_COMMON
        /*
         * If this CPU has been elected to perform the nohz idle　　　　//如果当前cpu被选择做nohz idle balance
         * balance. Other idle CPUs have already rebalanced with　　　　//那么其他idle cpu已经通过nohz_idle_balance()进行了balance和更新nohz.next_balance时间戳
         * nohz_idle_balance() and nohz.next_balance has been　　　　　　
         * updated accordingly. This CPU is now running the idle load　　//当前cpu正在为自己运行idle load balance
         * balance for itself and we need to update the　　　　　　　　　　 //我们应该相应地更新nohz.next_balance时间戳
         * nohz.next_balance accordingly.
         */
        if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))　　　　//如果当前cpu为idle状态，并且nohz.next_balance晚于rq->next_balance
            nohz.next_balance = rq->next_balance;            //更新nohz的下次balance时间
#endif
    }
}

（2）load_balance函数

主要框架：

执行流程示意图：

spark 负载均衡开关负载均衡pool_spark 负载均衡开关_03

代码流程：

　　1. 初始化load balance env结构体：其中就是将当前的cpu作为light cpu（dst cpu），为loading重的cpu分担一些任务。确定load balance的调度组范围。

　　在调度域中的cpu；2、并且也是active的cpu，两者的交集作为env->cpus

redo：（一般这里会重新选择src cpu，即busiest）

out_balanced

　　3. 找到最繁忙的group，并计算出不平衡值imbalance ----find_busiest_group()；未找到busiest group，则goto out_balanced

out_balanced

　　5. 初始化记录迁移的task数量ld_moved = 0

　　6. 判断busiest rq上是否有超过1个task：

如果有>1个task

初始化flags，限制单次load balance只能遍历32个task

more_balance：

判断是否不满足条件做balance了？busiest rq上task <=1 或是 task <=2但在active balance，满足则清掉flag：LBF_ALL_PINNED，然后goto no_move
更新busiest rq的clock时间信息
进行迁移detach操作，从busiest rq剥离要迁移的task
将迁移的task attach到dst rq上，并记录本次负载均衡，总的迁移task的数量
判断env.flags是否有LBF_NEED_BREAK（该flag表示一次性detach检查的task数量超过了门限，且还没有抹平imbalance），清掉flag：LBF_NEED_BREAK，然后goto more_balance
如果出现过由于cpuset限制了dst cpu，并重新选择了dst cpu，而且不平衡仍然存在：使用new dst cpu作为dst cpu，goto more_balance，重新检查detach和attach
如果由于cpu亲和度导致不能达到平衡，则设置parent sd的sg type为group_imbalanced，相当于将不平衡传递上去
如果因为cpu亲和度，而导致所有task都不能迁移到dst cpu：

则去掉当前dst cpu，判断如果剩下的cpus不在dst cpu所在group内，那还有可能作为busiest，那么goto redo，重新选择busiest
如果剩下cpus没有作为busiest的可能了，那么goto out_all_pinned

no_move：

如果到这里迁移的task总数（ld_moved）仍为0

统计periodic balance情况下balance 失败的计数
判断是否需要active load balance（更激进的balance，从busiest cpu push task出去）

如果src cpu或者busiest cpu被标记为reserved了，那就不能进行balance了，goto out
如果busiest 的curr进程被cpuset限制而不能迁移到当前cpu；或者busiest上的curr进程是开了boost strcit max，并且在rtg内，并且dst cpu的orig_capacity < src cpu的orig_capacity。则设置env.flags |= LBF_ALL_PINNED，goto out_one_pinned
如果busiest不在做active balance，并且busiest cpu不处于isolate。那就设置busiest正在做active balance的标记：busiest->active_balance = 1；再指定push task的dst cpu为当前cpu；标记接下来要唤醒active balance进程（active_balance = 1）；并将当前cpu标记为reserved
判断是否要唤醒active balance进程（active_balance 是否为 1）

满足，则开启busiest上的active_load_balance的工作队列（函数active_load_balance_cpu_stop，其是stop class的进程），设置不需要继续load balance标记：*continue_balancing = 0。-----active load balance主要是对running进程进行迁移，后续再写blog补充，这里不再详述。

走到这里，说明我们已经触发了active balance。那就设置当前sd balance失败计数改为cache_nice_tries + 10 -1

如果有迁移过task（ld_moved不为0），则清零balance失败的计数

7. 如果没触发active load balance，或者没有触发到主动active balance

上述条件满足，则说明现在还是处于不平衡状态。那就要将触发balance的间隔缩短到最小：sd->min_interval
如果条件不满足，说明已经触发了active balance了。如果触发balance间隔 < sd->max_interval，那就再将间隔放大到2倍（*2）

　　8. goto out

out_balanced：

9. 判断当前是在MC level，并不是所有task都不能迁移的情况（即，尽管遇到了一些亲和度冲突，但是最终还是达到平衡了），则清掉parent sd的imbalance flag：group_imbalance = 0

out_all_pinned：

10. 因为在当前这个level，所有task都被亲和度绑定了，所以我们无法迁移它们。则保留parent sd的imbalance flag。这里会顺便统计负载均衡的次数（以当前cpu idle状态区分统计）。同时，清零balance失败的计数。

out_one_pinned：

11. 这里会清零总迁移task的计数ld_moved = 0

12. 如果是newly idle的情况，直接goto out

13. 如果出现所有task都由于亲和度都无法迁移，且做balance的间隔时间 < MAX_PINNED_INTERVAL（512）；或者做balance的间隔时间 < sd->max_interval。那么就把间隔再放大一倍（*2）

out：

14. return ld_moved，即本次balance总迁移task数；但如果出现了某些异常情况，就return 0。

/*
 * Check this_cpu to ensure it is balanced within domain. Attempt to move
 * tasks if there is an imbalance.
 */
static int load_balance(int this_cpu, struct rq *this_rq,
            struct sched_domain *sd, enum cpu_idle_type idle,
            int *continue_balancing)
{
    int ld_moved = 0, cur_ld_moved, active_balance = 0;
    struct sched_domain *sd_parent = sd->parent;
    struct sched_group *group = NULL;
    struct rq *busiest = NULL;
    struct rq_flags rf;
    struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);　　//这里是唯一使用该mask的地方，边使用边赋值，一个per-cpu变量。用于确定env->cpus

    struct lb_env env = {                    //填充load balance所需结构体
        .sd        = sd,                        //调度域为当前sd
        .dst_cpu    = this_cpu,                //task迁移的目标cpu为当前cpu
        .dst_rq        = this_rq,                //task迁移的目标rq为当前cpu rq
        .dst_grpmask    = sched_group_span(sd->groups),　　//获取要做load balance sd对应的sg范围
        .idle        = idle,                    //task迁移的目标cpu(即当前cpu)，它的idle状态
        .loop_break    = sched_nr_migrate_break,
        .cpus        = cpus,
        .fbq_type    = all,
        .tasks        = LIST_HEAD_INIT(env.tasks),        //初始化链表，后续要迁移的task会放进这个链表中
        .imbalance    = 0,
        .flags        = 0,
        .loop        = 0,
    };
                                                                    //prefer_spread设置：
    env.prefer_spread = (prefer_spread_on_idle(this_cpu) &&            //当前cpu的prefer spread idle是否设置，并且 下面条件不成立：
                !((sd->flags & SD_ASYM_CPUCAPACITY) &&                //有SD_ASYM_CPUCAPACITY，且 当前cpu不在asym_cap_sibling_cpus范围内
                 !cpumask_test_cpu(this_cpu,                        
                         &asym_cap_sibling_cpus)));

    cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);    //取出1、在调度域中的cpu；2、并且也是active的cpu，两者的交集作为env->cpus

    schedstat_inc(sd->lb_count[idle]);        //统计sd->lb_count[idle]++

redo:
    if (!should_we_balance(&env)) {            //（2-1）判断目标cpu是否可以作为light cpu进行load balance（light cpu说明任务较少，会从任务较重的cpu上pull任务过来）
        *continue_balancing = 0;
        goto out_balanced;                    //不满足条件的话，就清空continue_balancing = 0，跳过迁移过程
    }

    group = find_busiest_group(&env);        //（2-2）如果存在不均衡，那么就找到调度域中，本层级sched_group链表负载最重的(busiest)的sched_group：busiest的group。同时计算出负载均衡所需迁移的load
    if (!group) {
        schedstat_inc(sd->lb_nobusyg[idle]);        //如果没有找到busiest group，那就统计调度数据，之后退出
        goto out_balanced;
    }

    busiest = find_busiest_queue(&env, group);    //（2-3）找出busiest sched_group中，即负载最重cpu对应的rq
    if (!busiest) {
        schedstat_inc(sd->lb_nobusyq[idle]);
        goto out_balanced;
    }

    BUG_ON(busiest == env.dst_rq);            //如果迁移目标cpu rq是busiest的话，那就出问题了！

    schedstat_add(sd->lb_imbalance[idle], env.imbalance);        //累计统计不同idle状态下进程调度数据：env.imbalance 不平衡值

    env.src_cpu = busiest->cpu;        //将上面复杂流程找到的迁移的src cpu和对应src rq，填进env结构体中
    env.src_rq = busiest;

    ld_moved = 0;
    if (busiest->nr_running > 1) {            //判断是否有task可以进行迁移，至少要有1个task
        /*
         * Attempt to move tasks. If find_busiest_group has found            //尝试迁移task
         * an imbalance but busiest->nr_running <= 1, the group is           //如果find_busiest_group函数找到了不平衡，但是busiest->nr_running <= 1，那这个group还是不平衡的
         * still unbalanced. ld_moved simply stays zero, so it is            //ld_moved仍保持0，所这种情况当作不平衡是正确的
         * correctly treated as an imbalance.
         */
        env.flags |= LBF_ALL_PINNED;        //初始化flag，初始状态为：dst cpu上所有task都由于cpu亲和度拴住了，无法迁移
        env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);    //限制一次load balance遍历task的数量最大不超过32个，因为这个期间是关中断的

more_balance:
        rq_lock_irqsave(busiest, &rf);

        /*
         * The world might have changed. Validate assumptions.            //代码运行到这里时，有时状态已经发生了变化，所以需要再check：
         * And also, if the busiest cpu is undergoing active_balance,     //如果busiest cpu在做active balance，但是它的rq上只有 <= 2个task
         * it doesn't need help if it has less than 2 tasks on it.        //那么它不需要进行迁移
         */

        if (busiest->nr_running <= 1 ||                                    //busiest rq只有1个task或没有task
            (busiest->active_balance && busiest->nr_running <= 2)) {       //或者   （busiest rq处于active balance状态 并且 rq有<=2个task）
            rq_unlock_irqrestore(busiest, &rf);
            env.flags &= ~LBF_ALL_PINNED;                //清flag：LBF_ALL_PINNED
            goto no_move;                                //退出，不进行迁移 no_move
        }

        update_rq_clock(busiest);                    //更新busiest rq的clock时间信息

        /*
         * cur_ld_moved - load moved in current iteration                //cur_ld_moved：便是本次balance中一次迁移操作的load(task数量)
         * ld_moved     - cumulative load moved across iterations        //ld_moved：表示累计本次balance多次迁移，并已经迁移了的总load（总task数量）
         */
        cur_ld_moved = detach_tasks(&env);            //（2-4）进行迁移detach操作，从原rq脱离，返回的是迁移的task数量

        /*
         * We've detached some tasks from busiest_rq. Every             //我们已经从busest  rq上detach了一些task
         * task is masked "TASK_ON_RQ_MIGRATING", so we can safely      //每一个task都标记为：TASK_ON_RQ_MIGRATING
         * unlock busiest->lock, and we are able to be sure             //所以，我们可以安全地进行unlock busiest->lock
         * that nobody can manipulate the tasks in parallel.            //同时，我们可以保证没有人能并行地篡改它
         * See task_rq_lock() family for the details.
         */

        rq_unlock(busiest, &rf);

        if (cur_ld_moved) {                    //如果有task detach了，那么就需要进行attach到dst cpu上，完成迁移
            attach_tasks(&env);                //(2-5)将所有task attach到新的rq上
            ld_moved += cur_ld_moved;          //累加当前完成task迁移的数量，统计总迁移task数
        }

        local_irq_restore(rf.flags);

        if (env.flags & LBF_NEED_BREAK) {        //上面在detach时，loop超过一定门限，就会设置该flag。目前平台该门限设置与loop限制的最大次数相同，所以相当于没有break
            env.flags &= ~LBF_NEED_BREAK;
            goto more_balance;
        }

        /*
         * Revisit (affine) tasks on src_cpu that couldn't be moved to        //重新再检查src cpu上的那些不能迁移的task，
         * us and move them to an alternate dst_cpu in our sched_group        //将他们迁移到所在调度组内的其他可选dst cpu上，
         * where they can run. The upper limit on how many times we            //在同一个src cpu轮询的次数上限取决于调度组内cpu个数
         * iterate on same src_cpu is dependent on number of CPUs in our
         * sched_group.
         *
         * This changes load balance semantics a bit on who can move        //这就把load balance搞得有点像：把load迁移到一个given cpu上。
         * load to a given_cpu. In addition to the given_cpu itself            //加上given cpu其本身（或者是一个兄弟cpu[lib_cpu]代表given cpu，因为given cpu在nohz-idle），
         * (or a ilb_cpu acting on its behalf where given_cpu is            //那么我们现在就是将balance_cpu的load迁移到given cpu上。
         * nohz-idle), we now have balance_cpu in a position to move        
         * load to given_cpu. In rare situations, this may cause            //在一些较少发生情况下，这可能导致冲突（balance cpu 和 given cpu/ilb_cpu互相是独立挑选出来的，
         * conflicts (balance_cpu and given_cpu/ilb_cpu deciding            //并且也可能在同一时刻触发迁移load到given cpu），从而导致load一下子迁移到given cpu太多
         * _independently_ and at _same_ time to move some load to
         * given_cpu) causing exceess load to be moved to given_cpu.
         * This however should not happen so much in practice and            //但是，这不应该经常发生，并且后续的load balance操作也会纠正这些迁移过多的load。
         * moreover subsequent load balance cycles should correct the
         * excess load moved.
         */
        if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {    //如果出现过由于cpuset限制了dst cpu，并重新选择了dst cpu，而且不平衡仍然存在：使用new dst cpu作为dst cpu重新检查detach和attach

            /* Prevent to re-select dst_cpu via env's CPUs */
            cpumask_clear_cpu(env.dst_cpu, env.cpus);            //防止重新又选到不满足cpuset条件的cpu作为dst cpu，这里将cpuset限制的dst cpu从env.cpus中去除

            env.dst_rq     = cpu_rq(env.new_dst_cpu);                //将new dst cpu改为dst cpu，相应rq也改为dst rq
            env.dst_cpu     = env.new_dst_cpu;
            env.flags    &= ~LBF_DST_PINNED;                        //去掉flag：LBF_DST_PINNED
            env.loop     = 0;                                    //重置loop
            env.loop_break     = sched_nr_migrate_break;

            /*
             * Go back to "more_balance" rather than "redo" since we        //跳转到more balance，因为src cpu没有发生改变
             * need to continue with same src_cpu.
             */
            goto more_balance;
        }

        /*
         * We failed to reach balance because of affinity.                    //由于cpuset亲和度导致无法达到负载balance
         */
        if (sd_parent) {　　　　　　　　　　　　　　　　　　　　　　　　　　　　//如果当前sd是MC level
            int *group_imbalance = &sd_parent->groups->sgc->imbalance;

            if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)　　　　　　//如果有因cpuset而遗留imbalance的
                *group_imbalance = 1;                                        //则设置当前sd的parent sd的 sgc->imbalance，让parent sd做rebalance的概率增高：*group_imbalance = 1
        }

        /* All tasks on this runqueue were pinned by CPU affinity */
        if (unlikely(env.flags & LBF_ALL_PINNED)) {                            //如果dst cpu上所有task都由于cpu亲和度拴住了，无法迁移(flag：LBF_ALL_PINNED)
            cpumask_clear_cpu(cpu_of(busiest), cpus);                          //将busiest rq对应的cpu，从cpus集合中去掉
            /*
             * Attempting to continue load balancing at the current         //仅当还有剩余的active cpu可能作为busiest cpu，且被pull load的busiest cpu不被包含在接收迁移load的dst group中
             * sched_domain level only makes sense if there are                
             * active CPUs remaining as possible busiest CPUs to            //尝试继续在当前调度域level进行load balance
             * pull load from which are not contained within the
             * destination group that is receiving any migrated
             * load.
             */
            if (!cpumask_subset(cpus, env.dst_grpmask)) {        //如果剩余的cpus都不在dst cpu的调度域中
                env.loop = 0;                                    //则重置loop
                env.loop_break = sched_nr_migrate_break;
                goto redo;                                        //重新进行是否需要load balance
            }
            goto out_all_pinned;
        }
    }

no_move:
    if (!ld_moved) {                //经过几轮的努力尝试，最终迁移的进程数ld_moved还是0，说明balance失败
        /*
         * Increment the failure counter only on periodic balance.    //仅在周期性balance时，累计失败次数。
         * We do not want newidle balance, which can be very        //我们不想统计newidle balance，它会非常频繁，破坏失败计数，
         * frequent, pollute the failure counter causing            //从而导致激进的cache-hot迁移和active balance
         * excessive cache_hot migrations and active balances.
         */
        if (idle != CPU_NEWLY_IDLE) {            //如果不是newly idle
            if (env.src_grp_nr_running > 1)        //并且src group中的nr_running > 1
                sd->nr_balance_failed++;        //则统计失败次数
        }

        if (need_active_balance(&env)) {                //（2-6）判断是否需要active balance
            unsigned long flags;

            raw_spin_lock_irqsave(&busiest->lock, flags);

            /*
             * The CPUs are marked as reserved if tasks                //cpu中task从其他cpu上pull/push的，这样的cpu会被标记为reserved
             * are pushed/pulled from other CPUs. In that case,        //在这个情况下，就要退出load balance
             * bail out from the load balancer.
             */
            if (is_reserved(this_cpu) ||                            //当前this cpu是reserved的
                    is_reserved(cpu_of(busiest))) {                 //或者busiest cpu是reserved的
                raw_spin_unlock_irqrestore(&busiest->lock,
                                flags);
                *continue_balancing = 0;                            //那么就要退出load balance，清除flag：*continue_balancing = 0
                goto out;
            }

            /*
             * Don't kick the active_load_balance_cpu_stop,            //如果在busiest cpu上的curr进程不能迁移到当前cpu（this cpu）上，
             * if the curr task on busiest CPU can't be             //不要触发active_load_balance_cpu_stop
             * moved to this_cpu:
             */
            if (!cpumask_test_cpu(this_cpu,                            //this_cpu不在busiest的curr进程的cpuset内
                        &busiest->curr->cpus_allowed)
                || !can_migrate_boosted_task(busiest->curr,            //或者busiest上的curr进程是开了boost strcit max，并且在rtg内，并且dst cpu的orig_capacity < src cpu的orig_capacity
                        cpu_of(busiest), this_cpu)) {                  //从而导致不能迁移busiest上的curr进程
                raw_spin_unlock_irqrestore(&busiest->lock,
                                flags);
                env.flags |= LBF_ALL_PINNED;                        //则设置flag：env.flags |= LBF_ALL_PINNED
                goto out_one_pinned;
            }

            /*
             * ->active_balance synchronizes accesses to        //active_balance标记是与active_balance_work同步的
             * ->active_balance_work.  Once set, it's cleared    //标记只在active load balance完成之后会清除
             * only after active load balance is finished.
             */
            if (!busiest->active_balance &&                //busiest没有处于active balance状态
                !cpu_isolated(cpu_of(busiest))) {    //并且busiest rq的cpu没有isolate
                busiest->active_balance = 1;            //那么就标记busiest rq的状态为active balance
                busiest->push_cpu = this_cpu;            //push cpu为当前cpu（this_cpu）
                active_balance = 1;                        //同时标记active_balance =1，表示active_balance work开始
                mark_reserved(this_cpu);                //标记this cpu为reserved
            }
            raw_spin_unlock_irqrestore(&busiest->lock, flags);

            if (active_balance) {
                stop_one_cpu_nowait(cpu_of(busiest),                        //开启active_load_balance work的工作队列（这个进程调度类是stop class）
                    active_load_balance_cpu_stop, busiest,                    //工作函数为：active_load_balance_cpu_stop
                    &busiest->active_balance_work);
                *continue_balancing = 0;
            }

            /* We've kicked active balancing, force task migration. */    //已经触发active balance，并强制执行task迁移
            sd->nr_balance_failed = sd->cache_nice_tries +                //把balance失败计数改为cache_nice_tries + 10 -1
                    NEED_ACTIVE_BALANCE_THRESHOLD - 1;
        }
    } else
        sd->nr_balance_failed = 0;        //load balance成功发生迁移的话，清空失败计数

    if (likely(!active_balance)) {            //没有触发active balance 或者active balance完成了
        /* We were unbalanced, so reset the balancing interval */
        sd->balance_interval = sd->min_interval;        //重置balance间隔时间为min_interval
    } else {
        /*
         * If we've begun active balancing, start to back off. This        //如果我们正在进行active balance，那么就要将间隔搞大点
         * case may not be covered by the all_pinned logic if there
         * is only 1 task on the busy runqueue (because we don't call
         * detach_tasks).
         */
        if (sd->balance_interval < sd->max_interval)    //balance间隔 < max_interval
            sd->balance_interval *= 2;                    //则将balance间隔放大成2倍
    }

    goto out;

out_balanced:
    /*
     * We reach balance although we may have faced some affinity            //尽管我们遇到了一些cpu亲和度的限制，但我们已经达到balance了
     * constraints. Clear the imbalance flag only if other tasks got        //只有当其他task有机会迁移或者解决不平衡，才可以清空不平衡的flag
     * a chance to move and fix the imbalance.
     */
    if (sd_parent && !(env.flags & LBF_ALL_PINNED)) {                    //当前调度域并非root doamin，并且flag ！=LBF_ALL_PINNED（cpuset仅限制了部分，或没有限制）
        int *group_imbalance = &sd_parent->groups->sgc->imbalance;

        if (*group_imbalance)            //如果有group不平衡标记
            *group_imbalance = 0;        //清空group不平衡标记
    }

out_all_pinned:
    /*
     * We reach balance because all tasks are pinned at this level so        //由于所有task都被拴住了，无法进行task迁移，所以这个level无法达到真正平衡
     * we can't migrate them. Let the imbalance flag set so parent level    //让保留不平衡flag，让parent level尝试迁移
     * can try to migrate them.
     */
    schedstat_inc(sd->lb_balanced[idle]);        //统计load balance计数

    sd->nr_balance_failed = 0;        //清空load balance失败计数

out_one_pinned:
    /* tune up the balancing interval */                            //把balance间隔放大一些（个人猜测为了延后一段时间，再重新尝试load balance ）
    if (((env.flags & LBF_ALL_PINNED) &&                        //1. 如果flag有LBF_ALL_PINNED 并且balance间隔 < MAX_PINNED_INTERVAL（512）
            sd->balance_interval < MAX_PINNED_INTERVAL) ||        
            (sd->balance_interval < sd->max_interval))            //2. balance间隔 < max_interval
        sd->balance_interval *= 2;                                //将balance间隔放大为2倍（上面条件满足其一）

    ld_moved = 0;            //清空task迁移计数
out:
    trace_sched_load_balance(this_cpu, idle, *continue_balancing,
                 group ? group->cpumask[0] : 0,
                 busiest ? busiest->nr_running : 0,
                 env.imbalance, env.flags, ld_moved,
                 sd->balance_interval, active_balance,
                 sd_overutilized(sd), env.prefer_spread);
    return ld_moved;
}

（2-1）判断dst cpu（其实就是当前cpu）是否可以作为light cpu进行load balance（light cpu说明任务较少，会从任务较重的cpu上pull任务过来）

spark 负载均衡开关负载均衡pool_ci_04

其中，

group_balance_mask(sg)获取的是：sg->sgc->cpumask（sg为发生负载均衡的sd对应的sg）。根据不同level，会代表不同的cpu（获取的是：sg->cpumask）。如果是MC level，则是当前cpu（比如cpu 0，就是0x1）；如果是DIE level，则是当前cluster内的所有cpu core（比如cluster-0内的cpu 0 - 4，就是0x0F）。

　　group_balance_cpu_not_isolated(sg)其实是sched_group_span(sg)和group_balance_mask(sg)的交集（这2者其实是一致的），并且不处于isolate的第1个cpu。sched_group_span(sg)：根据不同level，会代表不同的cpu（获取的是：sg->cpumask）。如果是MC level，则是当前cpu（比如cpu 0，就是0x1）；如果是DIE level，则是当前cluster内的所有cpu core（比如cluster-0内的cpu 0 - 4，就是0x0F）。这个与上面的sg->sgc->cpumask一样。

static int should_we_balance(struct lb_env *env)
{
    struct sched_group *sg = env->sd->groups;
    int cpu, balance_cpu = -1;

    /*
     * Ensure the balancing environment is consistent; can happen
     * when the softirq triggers 'during' hotplug.
     */
    if (!cpumask_test_cpu(env->dst_cpu, env->cpus))        //检测目标cpu是否同时满足cpus的条件：1、在env-sd调度域中；2、cpu处于active
        return 0;

    /*
     * In the newly idle case, we will allow all the CPUs
     * to do the newly idle load balance.
     */
    if (env->idle == CPU_NEWLY_IDLE)        //如果是new idle load balance，那么允许所有env->cpus做newly idle load balance
        return 1;

    /* Try to find first idle CPU */
    for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {    //从sg范围和env->cpus的交集中，找到第一个idle的cpu
        if (!idle_cpu(cpu) || cpu_isolated(cpu))
            continue;

        balance_cpu = cpu;        //将第一个idle cpu标记到balance_cpu
        break;
    }

    if (balance_cpu == -1)        //如果没有找到idle cpu，那么就选择sg范围中unisolate的第一个cpu
        balance_cpu = group_balance_cpu_not_isolated(sg);

    /*
     * First idle CPU or the first CPU(busiest) in this sched group
     * is eligible for doing load balancing at this and above domains.
     */
    return balance_cpu == env->dst_cpu;        //第一个idle cpu或者第一个unisolate的cpu是有资格做load balance的
}

（2-2）如果存在不均衡，那么就找到调度域中，本层级sched_group链表负载最重的(busiest)的sched_group：busiest的group。同时计算出负载均衡所需迁移的load

初始化sds结构体
更新用于load balance的调度域很多相关数据---update_sd_lb_stats()
当开启EAS的情况下，并且pd（性能域）存在且sd不处于overutil时（系统没有进入overutilized状态之前，EAS起作用。如果EAS起作用，那么负载可能就是不均衡的（考虑功耗），这是依赖task placement的结果）因此，这时候要过滤一些不需要进行load banlance的情况：

非newly idle情况，不进行load balance-----------这个在其他平台不存在，个人也觉得这个过滤条件有点奇怪，可能要结合qcom平台的EAS一起看
迁移dst cpu不在env->sd->groups范围内或者没有找到最繁忙的group，那么也不进行load balance。
如果busiest group中有超过1个cpu（即现在是DIE level，cluster间迁移），并且local cpu的orig capacity大于busiest cpu的。就不进行load balance---------不允许从小核将task迁移到大核
如果local cpu与busiest cpu有同样的orig capacity，或者是cluster内迁移：

进一步判断busiest cpu上task数量是否只有1个，如果是，则不进行load balance。------这种明显是1个load大的task，往兄弟cpu或者同capacity cpu迁移，不会改善情况，所以避免迁移。

检查ASYM feature，这个功能对应sd->flags为SD_ASYM_PACKING，该flag只会在SMT level带有。所以当前平台不支持，所以这块暂不关注()

判断一些情况，满足则必须进行load balance（force_balanced）：

busiest group type为group_imbalanced
dst cpu处于空闲，且local group有空闲capacity，且busiest 处于overload状态（group_no_capacity == 1）
busiest group type为group_misfit_task，即group内有misfit task
再判断一些情况，满足则不需要进行load balance（out_balanced）：

目标cpu所在的local group比busiest group更繁忙（通过比较avg_load）。avg_load计算方法：调度域中所有group的total load * 1024 / 调度域中所有group的total capacity
目标cpu所在的local group的avg_load >= 调度域的avg_load，那就不要pull task（因为这个local group肯定不是最空闲的group，兄弟姐妹group中有更空闲的）
如果目标cpu处于idle，且busiest group不处于overload状态，且目标cpu所在的local group中的idle cpu个数 <= busiest group的idle cpu个数+1
如果目标cpu不是idle状态（newly idle、not idle），使用imbalance_pct来判断是否需要进行load balance。判断计算方法：busiest group的avg_load *100 <= env->sd->imbalance_pct * local group的avg_load，imbalance_pct在sd建立时已确定（MC：117、DIE：125）

根据之前的判断进行force_balanced，或者不做load balance直接return NULL

force_balanced：

保存busiest group type到env_src_group_type
进一步判断一些条件并计算imbalance值（calculate_imbalance）
如果imbalance不为0，则return busiest group；反之return NULL

out_balanced：

设置env->imbalance = 0，并return NULL

/**
 * find_busiest_group - Returns the busiest group within the sched_domain
 * if there is an imbalance.
 *
 * Also calculates the amount of weighted load which should be moved
 * to restore balance.
 *
 * @env: The load balancing environment.
 *
 * Return:    - The busiest group if imbalance exists.
 */
static struct sched_group *find_busiest_group(struct lb_env *env)
{
    struct sg_lb_stats *local, *busiest;
    struct sd_lb_stats sds;

    init_sd_lb_stats(&sds);            //初始化sds结构体

    /*
     * Compute the various statistics relavent for load balancing at
     * this level.
     */
    update_sd_lb_stats(env, &sds);                //（2-2-1）更新用于load balance的调度域很多相关数据

    if (static_branch_unlikely(&sched_energy_present)) {            //sched_energy_present值作为是否开启EAS选核的开关，当前平台是打开的
        struct root_domain *rd = env->dst_rq->rd;

        if (rcu_dereference(rd->pd) && !sd_overutilized(env->sd)) {                //root doamin的perf domain结构体存在？（暂时没理解），并且env->sd不处于overutil
            int cpu_local, cpu_busiest;
            unsigned long capacity_local, capacity_busiest;

            if (env->idle != CPU_NEWLY_IDLE)                //非newly idle情况，那么就说明没有imbalance;只考虑newly idle情况做load balance？？？这里看了下AOSP kernel-5.10没有这个奇怪条件过滤
                goto out_balanced;

            if (!sds.local || !sds.busiest)                    //迁移目标cpu不在当前env->sd->groups内，或者没有找到busiest group。那么就说明没有imbalance
                goto out_balanced;

            cpu_local = group_first_cpu(sds.local);            //取出sds.local group中的第一个cpu
            cpu_busiest = group_first_cpu(sds.busiest);    //取出找到的busiest group中的第一个cpu

            /* TODO:don't assume same cap cpus are in same domain */
            capacity_local = capacity_orig_of(cpu_local);            //获取上面的2个cpu的capacity
            capacity_busiest = capacity_orig_of(cpu_busiest);
            if ((sds.busiest->group_weight > 1) &&            //busiest的group_weight > 1（group_weight表示sched group中有多少个CPU，MC level：1、DIE level：4）
                capacity_local > capacity_busiest) {        //并且，local capacity > busiest capacity。那么就说明没有imbalance
                goto out_balanced;                            
            } else if (capacity_local == capacity_busiest ||            //两者capacity相等
                   asym_cap_siblings(cpu_local, cpu_busiest)) {            //或者两个cpu是兄弟cpu
                if (cpu_rq(cpu_busiest)->nr_running < 2)            //如果busiest group中的第一个cpu上只有0个或者1个task
                    goto out_balanced;                                //那么就无法进行迁移
            }
        }
    }

    local = &sds.local_stat;            //取出busiest group和local group的数据
    busiest = &sds.busiest_stat;

    /* ASYM feature bypasses nice load balance check */
    if (check_asym_packing(env, &sds))            //ASYM feature绕过一些检查，能直接return给出结果
        return sds.busiest;

    /* There is no busy sibling group to pull tasks from */    //没有在group中，找到满足条件的task来pull
    if (!sds.busiest || busiest->sum_nr_running == 0)        //即，没有找到busiest group，或者busiest group中没有运行的task
        goto out_balanced;

    /* XXX broken for overlapping NUMA groups */
    sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)    //计算sds的avg_load = 调度域中所有group的total load * 1024 / 调度域中所有group的total capacity
                        / sds.total_capacity;

    /*
     * If the busiest group is imbalanced the below checks don't
     * work because they assume all things are equal, which typically
     * isn't true due to cpus_allowed constraints and the like.
     */
    if (busiest->group_type == group_imbalanced)    //如果busiest sg低level的group，因为cpu affinity没有balance成功，设置了group_imbalanced标志，强制在当前级别上进行balance
        goto force_balance;

    /*
     * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
     * capacities from resulting in underutilization due to avg_load.
     */
    if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&        //目标cpu处于空闲状态，local group有较多空余capacity，busiest group很繁忙（没有capacity）
        busiest->group_no_capacity)                                             //这种情况下，强制进行balance
        goto force_balance;

    /* Misfit tasks should be dealt with regardless of the avg load */
    if (busiest->group_type == group_misfit_task)                            //如果有misfit task，那么强制进行balance
        goto force_balance;

    /*
     * If the local group is busier than the selected busiest group
     * don't try and pull any tasks.
     */
    if (local->avg_load >= busiest->avg_load)            //如果目标cpu的local group比busiest group更繁忙（通过比较avg_load），那就不要尝试pull task
        goto out_balanced;

    /*
     * Don't pull any tasks if this group is already above the domain
     * average load.
     */
    if (local->avg_load >= sds.avg_load)                //如果目标cpu的local group的avg_load已经高于调度域的avg_load，那就不要pull task（因为这个local group肯定不是最空闲的group）
        goto out_balanced;

    if (env->idle == CPU_IDLE) {
        /*
         * This CPU is idle. If the busiest group is not overloaded　　//如果目标cpu处于idle，且busiest group不处于overload状态，且目标cpu所在的local group中的
         * and there is no imbalance between this and busiest group　　//idle cpu个数 <= busiest group的idle cpu个数+1
         * wrt idle CPUs, it is balanced. The imbalance becomes　　　　 //就认为没有imbalance
         * significant if the diff is greater than 1 otherwise we
         * might end up to just move the imbalance on another group
         */
        if ((busiest->group_type != group_overloaded) &&
                (local->idle_cpus <= (busiest->idle_cpus + 1)))
            goto out_balanced;
    } else {
        /*
         * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use        //在newly idle和not idle的case，使用imbalance_pct来计算，并判断是否需要进行load balance
         * imbalance_pct to be conservative.
         */
        if (100 * busiest->avg_load <=                            //busiest group的avg_load *100 <= env->sd->imbalance_pct * local group的avg_load
                env->sd->imbalance_pct * local->avg_load)        //满足则不要做balance，反之，要做load balance
            goto out_balanced;
    }

force_balance:
    /* Looks like there is an imbalance. Compute it */        //走到force_balance分支，说明应该是有imbalance，所以需要进行计算
    env->src_grp_type = busiest->group_type;            //保存busiest group type
    calculate_imbalance(env, &sds);                            //（2-2-2）计算imbalance的程度
    trace_sched_load_balance_stats(sds.busiest->cpumask[0],
                busiest->group_type, busiest->avg_load,
                busiest->load_per_task,    sds.local->cpumask[0],
                local->group_type, local->avg_load,
                local->load_per_task,
                sds.avg_load, env->imbalance);
    return env->imbalance ? sds.busiest : NULL;

out_balanced:
    env->imbalance = 0;
    return NULL;
}

（2-2-1）更新用于load balance的调度域（sds）很多相关数据

通过do-while循环，遍历所有env->sd->groups一轮（MC level的sd：遍历当前cluster内的每个cpu core；DIE level的sd：遍历2个cluster）

先判断dst cpu是否在当前sg范围内（函数 sched_group_span(sg)上面已经解释过，根据不同level有不同的范围，详情见（2-1）），

　　如果在，则置位local_group，并保存local相关信息，local group不会作为sd下busiest group。（local group迁移不会在同一group：所以，发生迁移，要么是MC level下同cluster内的不同cpu；要么DIE level下不同cluster）
　　同时，如果不是newly idle、或者更新间隔满足的情况下会更新sd的sgc（MC level更新cpu capacity；DIE level下更新cluster对应的sched group capacity）。也就是只更新与load balance相关的sgc。

更新当前sg中用于load balance的相关数据，记录到sgs结构体中。
如果是local group，即需要将task放到local group，那么就跳过对其判断是否是busiest group的过程。local group是task的接收者，那么就不需要对齐判断是否是busiest group，因为busiest group是需要将task送出，因为它太繁忙了。
在进行busiest group判断之前，还会进行2个补充更新逻辑：

（当前平台永不满足）系统会根据child sd是否支持perfer_sibling而重新更新group_no_capacity和group_type。但是当前平台永远不会成立。
除非大核没有capacity承担当前load，否则不允许将大核迁移到小核，即满足如下所有条件时，更新group_no_capacity为0、并更新group_type：当前sd时DIE level；当前sds处于group_no_capacity；asym_cap_sibling_group_has_capacity函数条件满足并返回true（实际此函数在当前平台永远返回false）。

判断当前sg是否是busiest group，通过循环找到busiest sg和sgs。
更新load balance调度域sd_lb_stats相关数据：

total_running
total_load
total_capacity
total_util

更新状态（主要是传递overutil到parent层sd level ）

sds中busiest group的task数到env->src_grp_nr_running
判断是否更新overload flag标记到root domain
更新overutil flag标记
如果因MC level有misfit task，那么同时设置DIE level为overutil（将misfit的状态，以overutil flag的方式逐层向parent传递给有不同capcity核的parent level以及 grand parent level等）
如果sd total util *105% > sd total capacity，那么同时要设置parent level overutil标记

/**
 * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
 * @env: The load balancing environment.
 * @sds: variable to hold the statistics for this sched_domain.
 */
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
    struct sched_domain *child = env->sd->child;
    struct sched_group *sg = env->sd->groups;
    struct sg_lb_stats *local = &sds->local_stat;
    struct sg_lb_stats tmp_sgs;
    bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;　　//这里当前平台不会满足，因为只有在DIE level才有child sd，但是flag：SD_PREFER_SIBLING在MC level上并没有包含。所以这里永远为false
    int sg_status = 0;

#ifdef CONFIG_NO_HZ_COMMON
    if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))　　　　　//当此次是newly idle的情况，并且nohz.has_blocked为true（has_blocked表示上次pelt更新时，还有blocked load没有更新完整）
        env->flags |= LBF_NOHZ_STATS;　　　　　　　　　　　　　　　　　　　　　　　//置位flag:LBF_NOHZ_STATS
#endif

    do {
        struct sg_lb_stats *sgs = &tmp_sgs;
        int local_group;

        local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(sg));                //判断目标cpu是否在当前调度组中
        if (local_group) {
            sds->local = sg;
            sgs = local;

            if (env->idle != CPU_NEWLY_IDLE ||
                time_after_eq(jiffies, sg->sgc->next_update))            //如果不是newly idle case，或者更新sgc（Sched Group Capacity调度组的cpu capacity）的间隔满足了。那么就需要更新group capacity
                update_group_capacity(env->sd, env->dst_cpu);　　　　　　　//（2-2-1-1）更新sched group capacity（sgc）相关数据　　　
        }

        update_sg_lb_stats(env, sg, sgs, &sg_status);                    //（2-2-1-2）更新用于load balance的调度组相关数据

        if (local_group)            //如果dst_cpu在当前调度组中(local group)，那么就不需要判断这个sg是否是busiest的sg
            goto next_group;

        /*
         * In case the child domain prefers tasks go to siblings        //假如子调度域中的task满足条件下，优先迁移到兄弟cpu
         * first, lower the sg capacity so that we'll try                //那么，就会尝试将所有超额的task迁移走，来降低调度组的capacity
         * and move all the excess tasks away. We lower the capacity    //只要local group有capacity能容纳这些超额的task，那么就会执行迁移操作
         * of a group only if the local group has the capacity to fit
         * these excess tasks. The extra check prevents the case where    //此外，还要防止一个case：当load最重的group已经利用不足（可能有一个超过所有其他任务之和的超大任务），然后不断地从负载最重的group中pull task
         * you always pull from the heaviest group when it is already
         * under-utilized (possible with a large weight task outweighs
         * the tasks on the system).
         */
        if (prefer_sibling && sds->local &&                        //因为prefer_sibling永远为false，所以这里条件永远不会满足
            group_has_capacity(env, local) &&                    　//has capacity的判断与group_no_capacity的逻辑相反
            (sgs->sum_nr_running > local->sum_nr_running + 1)) {
            sgs->group_no_capacity = 1;
            sgs->group_type = group_classify(sg, sgs);        //更新group的状态：overload、imbalance、misfit_task
        }

        /*
         * Disallow moving tasks from asym cap sibling CPUs to other        //不允许将task从大核迁移到小核，除非大核没有capacity承担当前的load
         * CPUs (lower capacity) unless the asym cap sibling group has
         * no capacity to manage the current load.
         */
        if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
            sgs->group_no_capacity &&
            asym_cap_sibling_group_has_capacity(env->dst_cpu,　　　　　//(2-2-1-3)这行的判断条件在当前平台永远返回false，永不满足
                        env->sd->imbalance_pct)) {
            sgs->group_no_capacity = 0;　　　　　　　　　　　　　　　　　　//更新group_no_capacity
            sgs->group_type = group_classify(sg, sgs);　　　　　　　　　//更新group_type
        }

        if (update_sd_pick_busiest(env, sds, sg, sgs)) {            //（2-2-1-4）判断当前sg是否是busiest group；通过不断循环判断来找到busiest group
            sds->busiest = sg;
            sds->busiest_stat = *sgs;
        }

next_group:
        /* Now, start updating sd_lb_stats */            //统计调度域中相关数据用于load balance
        sds->total_running += sgs->sum_nr_running;
        sds->total_load += sgs->group_load;
        sds->total_capacity += sgs->group_capacity;
        sds->total_util += sgs->group_util;

        trace_sched_load_balance_sg_stats(sg->cpumask[0],
                sgs->group_type, sgs->idle_cpus,
                sgs->sum_nr_running, sgs->group_load,
                sgs->group_capacity, sgs->group_util,
                sgs->group_no_capacity,    sgs->load_per_task,
                sgs->group_misfit_task_load,
                sds->busiest ? sds->busiest->cpumask[0] : 0);

        sg = sg->next;                            //遍历下一份group
    } while (sg != env->sd->groups);  //直到调度域内的所有group都遍一轮

#ifdef CONFIG_NO_HZ_COMMON
    if ((env->flags & LBF_NOHZ_AGAIN) &&                                                //LBF_NOHZ_AGAIN是否置位，其实就是看本函数上面置位的flag：LBF_NOHZ_STATS，后面在update_sg_lb_stats函数中是否都更新掉所有blocked load。如果还有遗留，则会有LBF_NOHZ_AGAIN的flag
        cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {                //nohz.idle是否在env->sd的调度域中？是的话，返回true

        WRITE_ONCE(nohz.next_blocked,
               jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));            //更新next_blocked的时间戳，应该是为了确定把遗留的blocked load更新掉
    }
#endif

    if (env->sd->flags & SD_NUMA)                                        //numa balance的情况，我们平台不属于numa架构，暂不考虑
        env->fbq_type = fbq_classify_group(&sds->busiest_stat);

    env->src_grp_nr_running = sds->busiest_stat.sum_nr_running;        //保存busiest group中的 sum_nr_running task数量

    if (!env->sd->parent) {                                        //如果env->sd位于root domain（即env->sd是DIE level）
        struct root_domain *rd = env->dst_rq->rd;

        /* update overload indicator if we are at root domain */
        WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);        //更新是否overload的flag到root domain
    }

    if (sg_status & SG_OVERUTILIZED)            //根据当前状态，更新调度组的overutil flag
        set_sd_overutilized(env->sd);　　　　　　　　//sd->shared->overutilized = true
    else
        clear_sd_overutilized(env->sd);　　　　　　 //sd->shared->overutilized = false

    /*
     * If there is a misfit task in one cpu in this sched_domain    //如果调度域中有一个misfit task，
     * it is likely that the imbalance cannot be sorted out among    //而且在调度域的多个cpu中迁移，不能解决imbalance问题，
     * the cpu's in this sched_domain. In this case set the            //那么就需要设置overutil的flag到parent调度域中
     * overutilized flag at the parent sched_domain.
     */
    if (sg_status & SG_HAS_MISFIT_TASK) {                        
        struct sched_domain *sd = env->sd->parent;

        /*
         * In case of a misfit task, load balance at the parent            //在上面所述情况下，要在parent调度域中进行load balance，
         * sched domain level will make sense only if the the cpus        //那么parent调度中必须要有不同的capacity的cpu，这样load balance才有意义。
         * have a different capacity. If cpus at a domain level have    //如果parent调度域中只有相同capacity的cpu，那么misfit task不会很好的适应这样的场景，
         * the same capacity, the misfit task cannot be well            //所以，在这样的domain做load balance是没有一亿元的
         * accomodated    in any of the cpus and there in no point in
         * trying a load balance at this level
         */
        while (sd) {                                //遍历层层调度域
            if (sd->flags & SD_ASYM_CPUCAPACITY) {    //找到有不同cpu capacity的调度域
                set_sd_overutilized(sd);            //然后设置overutil flag：sd->shared->overutilized = true
                break;
            }
            sd = sd->parent;
        }
    }

    /*
     * If the domain util is greater that domain capacity, load balancing
     * needs to be done at the next sched domain level as well.
     */
    if (env->sd->parent &&
        sds->total_capacity * 1024 < sds->total_util *                        //如果当前sds的total_capacity * 1024       < sds的total_util * 1078（大约105%） 
             sched_capacity_margin_up[group_first_cpu(sds->local)])
        set_sd_overutilized(env->sd->parent);                                //那么在上一层level也要做load balance，即在parent调度域设置overutil flag

}

（2-2-1-1）更新sched group capacity（sgc）相关数据

如果当前sd没有child，即这个sd为最底层的sd（MC level），那么只要更新cpu capacity。因为这个group只有一个cpu，所以capacity就是cpu capacity本身。cpu capacity由其max freq决定，同时也会受thermal限制影响。
当前平台big.LITTLE 8核（4+4），所以group capacity大致应该是这样的：（关于SD_OVERLAP：调度域是否有重叠，当前平台是没有重叠的）

　　　　MC level：由于是最底层sd，所以sgc->capacity就是对应每个cpu capacity。

这部分的代码中MC level的group capacity详细计算，参考CPU调度域/组建立中的流程，其中同样调用了该函数，因此不再赘述：CPU拓扑结构和调度域/组--更新对应group的capacity

　　　　DIE level：sgc->capacity为child->group的所有sgc->capacity之和。可以认为是cluster内所有cpu的capacity之和。

而最后，统计出每层sgc的group capacity，并记录最小和最大的capacity。

void update_group_capacity(struct sched_domain *sd, int cpu)
{
    struct sched_domain *child = sd->child;
    struct sched_group *group, *sdg = sd->groups;
    unsigned long capacity, min_capacity, max_capacity;
    unsigned long interval;

    interval = msecs_to_jiffies(sd->balance_interval);                //sgc的更新有间隔限制：1 ~ HZ/10
    interval = clamp(interval, 1UL, max_load_balance_interval);
    sdg->sgc->next_update = jiffies + interval;

    if (!child) {                        //如果是MC level的sd更新sgc，那么就只要更新cpu capacity，因为MC level的sg只有单个cpu在内
        update_cpu_capacity(sd, cpu);    //(2-6-1-1)更新cpu capacity
        return;
    }

    capacity = 0;
    min_capacity = ULONG_MAX;
    max_capacity = 0;

    if (child->flags & SD_OVERLAP) {            //这个是sd有重叠的情况，当前平台没有sd重叠
        /*
         * SD_OVERLAP domains cannot assume that child groups
         * span the current group.
         */

        for_each_cpu(cpu, sched_group_span(sdg)) {
            struct sched_group_capacity *sgc;
            struct rq *rq = cpu_rq(cpu);

            if (cpu_isolated(cpu))
                continue;

            /*
             * build_sched_domains() -> init_sched_groups_capacity()
             * gets here before we've attached the domains to the
             * runqueues.
             *
             * Use capacity_of(), which is set irrespective of domains
             * in update_cpu_capacity().
             *
             * This avoids capacity from being 0 and
             * causing divide-by-zero issues on boot.
             */
            if (unlikely(!rq->sd)) {
                capacity += capacity_of(cpu);
            } else {
                sgc = rq->sd->groups->sgc;
                capacity += sgc->capacity;
            }

            min_capacity = min(capacity, min_capacity);
            max_capacity = max(capacity, max_capacity);
        }
    } else  {
        /*
         * !SD_OVERLAP domains can assume that child groups        因为没有sd重叠，那么所有child sd的groups合在一起，就是当前的group
         * span the current group.
         */

        group = child->groups;
        do {                                                    //do-while遍历child sd的sg环形链表；当前平台为例，走到这里是DIE level，那么child sd就是MC level的groups
            struct sched_group_capacity *sgc = group->sgc;        //获取对应sgc
            __maybe_unused cpumask_t *cpus =
                    sched_group_span(group);                    //因为group是处于MC level，所以范围就是sg对应的cpu

            if (!cpu_isolated(cpumask_first(cpus))) {        //排除isolate状态的cpu
                capacity += sgc->capacity;                    //将每个sgc（cpu）的capacity累加起来
                min_capacity = min(sgc->min_capacity,        //保存最小的sgc->capacity
                            min_capacity);
                max_capacity = max(sgc->max_capacity,        //保存最大的sgc->capacity
                            max_capacity);
            }
            group = group->next;
        } while (group != child->groups);
    }

    sdg->sgc->capacity = capacity;                //将MC level中每个sgc->capacity累加起来，其总和作为DIE level中groups
    sdg->sgc->min_capacity = min_capacity;        //并保存最大、最小capacity
    sdg->sgc->max_capacity = max_capacity;
}

在MC level时，只需更新cpu capacity即可：

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
    unsigned long capacity = arch_scale_cpu_capacity(cpu);    //获取per_cpu变量cpu_scale
    struct sched_group *sdg = sd->groups;

    capacity *= arch_scale_max_freq_capacity(sd, cpu);        //获取per_cpu变量max_freq_scale，参与计算
    capacity >>= SCHED_CAPACITY_SHIFT;                        //这2步计算为：cpu_scale * max_freq_scale / 1024

    capacity = min(capacity, thermal_cap(cpu));                //计算得出的capacity不能超过thermal限制中的cpu的capacity
    cpu_rq(cpu)->cpu_capacity_orig = capacity;                //将计算得出的capacity作为当前cpu rq的cpu_capacity_orig

    capacity = scale_rt_capacity(cpu, capacity);        //(2-6-1-1-1)计算cfs rq剩余的cpu capacity

    if (!capacity)            //如果没有剩余cpu capacity给cfs了，那么就强制写为1
        capacity = 1;

    cpu_rq(cpu)->cpu_capacity = capacity;        //更新相关sgc capacity：cpu rq的cpu_capacity、sgc的最大/最小的capacity
    sdg->sgc->capacity = capacity;
    sdg->sgc->min_capacity = capacity;
    sdg->sgc->max_capacity = capacity;
}

（2-2-1-2）更新用于load balance的调度组（sg）相关数据

遍历在env->cpus中，且在当前sg范围的每个cpu，但忽略isolate的cpu。获取cpu对应的负载均线（cpu负载相关分析，见：）。不同 level下，对应获取cpu负载均线会有差别，具体信息（包括flags、imbalance_pct等）如下：

spark 负载均衡开关负载均衡pool_ci_05

spark 负载均衡开关负载均衡pool_ci_06

首先，根据balance的类别：idle、newly idle、busy，来使用对应特定的cpu负载均线。再次，根据是否为local group，还会有不同的bias：local group【负载均线值和实时值，取其较大】，非local group【负载均线值和实时值，取其较小】。将最后的结果作为该cpu的balance load。需要注意的是对应的cpu_load需要将idx减1，例如busy balance情况下，busy_idx = 2，而均线对应的是cpu_load[1]，即（old_load + new_load）/ 2 的那条均线。（注意：target_load() 和 source_load() 两个函数的实现）

cpu load这部分，我看了kernel-5.4以后就不再使用这种bias了，而是直接使用cpu_load[0]的数据。

统计用于load balance的sched group的相关数据【sgs结构中的数据】，包括:

group_load：group内上面获取的balance load之和
group_util：group内总cpu_util，即WALT或者PELT统计出来的cpu utilization。
sum_nr_running：group内所有cfs rq上的总runnable task数
sg_status：group的当前是否处于特殊状态：overload、overutil、misfit
sum_weighted_load：group内所有cpu实时负载之和
idle_cpus：group内idle cpu的个数
group_misfit_task_load：如果当前是在DIE level，判断是否有出现overload的情况，如有，则将所有rq中最大的misfit_task_load记录下来
group_capacity：对应了group->sgc->capacity。MC level下就是每个cpu的capacity；DIE level下就是每个cluster内cpu capacity之和。
avg_load：计算得出，计算公式为：group_load *1024 / group_capacity
group_weight：就是group->group_weight；MC level下 = 1；DIE level下= 4。
group_no_capacity：表示group是否还有剩余capacity，其实是记录group是否overload，如果已经overload，那么自然就没有剩余capacity了。overload的条件为：group_capacity * 100 < group_util * imbalance_pct （当前sd的不平衡百分比门限，MC：117、DIE：125）
group_type：也是根据group的相关参数，记录group处于什么状态的。主要类型由：

group_other（0，正常状态），
group_misfit_task（1，有misfit task），
group_imbalanced（2，不平衡，group_sgc_imbalance有值），
group_overloaded（3，过载，也就是group_no_capacity为true）。

　　　　　　　　有点类似sd_status，但是group type中多了"group_imbalance"，而没有overutil。

load_per_task：计算得出，计算公式为：sum_weighted_load / sum_nr_running

红色部分。

本函数根据（2-2-1）的循环，就是更新2轮：第1轮是cluster内的每个cpu sg；第2轮，是更新2个cluster sg。

/**
 * update_sg_lb_stats - Update sched_group's statistics for load balancing.
 * @env: The load balancing environment.
 * @group: sched_group whose statistics are to be updated.
 * @sgs: variable to hold the statistics for this group.
 * @sg_status: Holds flag indicating the status of the sched_group
 */
static inline void update_sg_lb_stats(struct lb_env *env,
                      struct sched_group *group,
                      struct sg_lb_stats *sgs,
                      int *sg_status)
{
    int local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));        //flag：local_group，表示迁移的目标cpu是否在当前调度组中
    int load_idx = get_sd_load_idx(env->sd, env->idle);        //（2-2-1-2-1）获取使用哪组cpu负载数据来用于load balance。分别由idle、newly idle、busy三组。
    unsigned long load;
    int i, nr_running;

    memset(sgs, 0, sizeof(*sgs));

    for_each_cpu_and(i, sched_group_span(group), env->cpus) {                //遍历每个满足：在group范围内，且又在env->cpus(load_balance_mask)范围内 的cpu
        struct rq *rq = cpu_rq(i);

        if (cpu_isolated(i))        //过滤isolate cpu
            continue;

        if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))　　//因为更新pelt时，有遗留的blocked load未更新掉；所以，在这里尝试通过update_nohz_stats函数来更新掉
            env->flags |= LBF_NOHZ_AGAIN;        //更新flag，因为此次update_nohz_stats还是没有完全更新掉blocked load，置位flag后，在流程后面启动周期来完成还遗留的blocked load

        /* Bias balancing toward CPUs of our domain: */
        if (local_group)
            load = target_load(i, load_idx);        //获取cpu[i]的负载，loacal group情况：idx非0情况下，会在[cpu负载均线值]和[cpu当前实时负载]两者中，选更【大】的
        else
            load = source_load(i, load_idx);        //获取cpu[i]的负载，非local group情况：idx非0情况下，会[cpu负载均线值]和[cpu当前实时负载]在两者中，选更【小】的

        sgs->group_load += load;            //统计调度组中所有cpu的总负载（会有bias）
        sgs->group_util += cpu_util(i);        //统计调度组总util
        sgs->sum_nr_running += rq->cfs.h_nr_running;    //统计所有group内cpu rq总的runnable task数量

        nr_running = rq->nr_running;        //当前层se内的runnable task数量（不包含子se的runnable task）；区别于h_nr_running，h_nr_running会包含所有子se的runnable task 数量
        if (nr_running > 1)
            *sg_status |= SG_OVERLOAD;        //标记这个cpu上有 >1 个runnable task

        if (cpu_overutilized(i)) {
            *sg_status |= SG_OVERUTILIZED;    //标记这个cpu已经overutil

            if (rq->misfit_task_load)
                *sg_status |= SG_HAS_MISFIT_TASK;    //标记rq中是否有misfit taskt
        }

#ifdef CONFIG_NUMA_BALANCING
        sgs->nr_numa_running += rq->nr_numa_running;
        sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
        sgs->sum_weighted_load += weighted_cpuload(rq);    //统计sg调度组中所有cpu实时的总负载：累加每个cpu rq的cfs_rq->avg.runnable_load_avg
        /*
         * No need to call idle_cpu() if nr_running is not 0
         */
        if (!nr_running && idle_cpu(i))
            sgs->idle_cpus++;                    //统计idle cpu个数

        if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
            sgs->group_misfit_task_load < rq->misfit_task_load) {
            sgs->group_misfit_task_load = rq->misfit_task_load;
            *sg_status |= SG_OVERLOAD;            //判断是否有overload（DIE level），并且会把rq更大的misfit task load作为group_misift_task_load
        }
    }

    /* Isolated CPU has no weight */
    if (!group->group_weight) {
        sgs->group_capacity = 0;        //isolate cpu的参数初始化
        sgs->avg_load = 0;
        sgs->group_no_capacity = 1;
        sgs->group_type = group_other;
        sgs->group_weight = group->group_weight;
    } else {
        /* Adjust by relative CPU capacity of the group */
        sgs->group_capacity = group->sgc->capacity;                    //active cpu的相关参数计算
        sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) /
                            sgs->group_capacity;

        sgs->group_weight = group->group_weight;

        sgs->group_no_capacity = group_is_overloaded(env, sgs);    //判断group是否overload
        sgs->group_type = group_classify(group, sgs);                //获取group type
    }

    if (sgs->sum_nr_running)
        sgs->load_per_task = sgs->sum_weighted_load /
                        sgs->sum_nr_running;
}

（2-2-1-2-1）获取使用哪组cpu负载数据来用于load balance。根据当前cpu的状态，分别对应idle balance(idle_idx)、newly idle balance(newidle_idx)、busy balance(busy_idx)三组负载idx

通过上面的表格看到：

busy_idx：2

newidle_idx：0

idle_idx：区分MC/DIE level。分别是MC：0，DIE：1

/**
 * get_sd_load_idx - Obtain the load index for a given sched domain.
 * @sd: The sched_domain whose load_idx is to be obtained.
 * @idle: The idle status of the CPU for whose sd load_idx is obtained.
 *
 * Return: The load index.
 */
static inline int get_sd_load_idx(struct sched_domain *sd,
                    enum cpu_idle_type idle)
{
    int load_idx;

    switch (idle) {
    case CPU_NOT_IDLE:
        load_idx = sd->busy_idx;
        break;

    case CPU_NEWLY_IDLE:
        load_idx = sd->newidle_idx;
        break;
    default:
        load_idx = sd->idle_idx;
        break;
    }

    return load_idx;
}

target_load() 和 source_load() 两个函数的实现

/*
 * Return a low guess at the load of a migration-source CPU weighted
 * according to the scheduling class and "nice" value.
 *
 * We want to under-estimate the load of migration sources, to
 * balance conservatively.
 */
static unsigned long source_load(int cpu, int type)
{
    struct rq *rq = cpu_rq(cpu);
    unsigned long total = weighted_cpuload(rq);　　//total:cfs_rq->avg.runnable_load_avg

    if (type == 0 || !sched_feat(LB_BIAS))
        return total;

    return min(rq->cpu_load[type-1], total);
}

/*
 * Return a high guess at the load of a migration-target CPU weighted
 * according to the scheduling class and "nice" value.
 */
static unsigned long target_load(int cpu, int type)
{
    struct rq *rq = cpu_rq(cpu);
    unsigned long total = weighted_cpuload(rq);　　//total:cfs_rq->avg.runnable_load_avg

    if (type == 0 || !sched_feat(LB_BIAS))
        return total;

    return max(rq->cpu_load[type-1], total);
}

cpu_util：注意WALT、PELT的最终使用的数据有不同。

static inline unsigned long cpu_util(int cpu)
{
    struct cfs_rq *cfs_rq;
    unsigned int util;

#ifdef CONFIG_SCHED_WALT
    u64 walt_cpu_util =
        cpu_rq(cpu)->wrq.walt_stats.cumulative_runnable_avg_scaled;

    return min_t(unsigned long, walt_cpu_util, capacity_orig_of(cpu));
#endif

    cfs_rq = &cpu_rq(cpu)->cfs;
    util = READ_ONCE(cfs_rq->avg.util_avg);

    if (sched_feat(UTIL_EST))
        util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

    return min_t(unsigned long, util, capacity_orig_of(cpu));
}

group_no_capacity的判断：

如果task数量 < group中cpu数量，则必定不是overload
上面条件不满足的话，如果使用WALT负载计算，本次load balance是busy balance或者newly idle balance的情况下，打开了walt rotation。则属于overload
上面条件不满足的话，判断：group_capacity *100 < group_util * 当前sd的imbalance_pct。如果满足，则属于overload
以上条件都不满足，则不是overload

/*
 *  group_is_overloaded returns true if the group has more tasks than it can
 *  handle.
 *  group_is_overloaded is not equals to !group_has_capacity because a group
 *  with the exact right number of tasks, has no more spare capacity but is not
 *  overloaded so both group_has_capacity and group_is_overloaded return
 *  false.
 */
static inline bool
group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
{
    if (sgs->sum_nr_running <= sgs->group_weight)
        return false;

#ifdef CONFIG_SCHED_WALT
    if (env->idle != CPU_NOT_IDLE && walt_rotation_enabled)
        return true;
#endif

    if ((sgs->group_capacity * 100) <
            (sgs->group_util * env->sd->imbalance_pct))
        return true;

    return false;
}

group_type的判断：

static inline enum
group_type group_classify(struct sched_group *group,
              struct sg_lb_stats *sgs)
{
    if (sgs->group_no_capacity)        //如果group_no_capacity == 1，则说明group overload
        return group_overloaded;

    if (sg_imbalanced(group))        //判断是否有group->sgc->imbalance，如有，则说明group处于不平衡状态（暂时还没有找到sgc->imbalance赋值的地方？？？）
        return group_imbalanced;

    if (sgs->group_misfit_task_load)    //如果有misfit task load，则说明group处于有misfit task的状态
        return group_misfit_task;

    return group_other;        //以上条件都不满足，则处于正常状态
}

（2-2-1-3）因为asym_cap_sibling_cpus只会在WALT逻辑中进行赋值，并且以当前平台的架构：4+4、6+2，甚至4+3+1的架构，都不会被赋有效值，一直是0。所以此函数永远返回false

static inline bool asym_cap_sibling_group_has_capacity(int dst_cpu, int margin)
{
    int sib1, sib2;
    int nr_running;
    unsigned long total_util, total_capacity;

    if (cpumask_empty(&asym_cap_sibling_cpus) ||
            cpumask_test_cpu(dst_cpu, &asym_cap_sibling_cpus))
        return false;

    sib1 = cpumask_first(&asym_cap_sibling_cpus);
    sib2 = cpumask_last(&asym_cap_sibling_cpus);

    if (!cpu_active(sib1) || cpu_isolated(sib1) ||
        !cpu_active(sib2) || cpu_isolated(sib2))
        return false;

    nr_running = cpu_rq(sib1)->cfs.h_nr_running +
            cpu_rq(sib2)->cfs.h_nr_running;

    if (nr_running <= 2)
        return true;

    total_capacity = capacity_of(sib1) + capacity_of(sib2);
    total_util = cpu_util(sib1) + cpu_util(sib2);

    return ((total_capacity * 100) > (total_util * margin));
}

（2-2-1-4）判断当前sg是否是busiest group；通过不断循环判断来找到busiest group

spark 负载均衡开关负载均衡pool_#ifdef_07

/**
 * update_sd_pick_busiest - return 1 on busiest group
 * @env: The load balancing environment.
 * @sds: sched_domain statistics
 * @sg: sched_group candidate to be checked for being the busiest
 * @sgs: sched_group statistics
 *
 * Determine if @sg is a busier group than the previously selected
 * busiest group.
 *
 * Return: %true if @sg is a busier group than the previously selected
 * busiest group. %false otherwise.
 */
static bool update_sd_pick_busiest(struct lb_env *env,
                   struct sd_lb_stats *sds,
                   struct sched_group *sg,
                   struct sg_lb_stats *sgs)
{
    struct sg_lb_stats *busiest = &sds->busiest_stat;

    /*
     * Don't try to pull misfit tasks we can't help.                       1、不尝试pull misfit tasks
     * We can use max_capacity here as reduction in capacity on some       2、使用max_capacity来确认group中是否有cpu能最终解决avg_load不平衡
     * CPUs in the group should either be possible to resolve
     * internally or be covered by avg_load imbalance (eventually).
     */
    if (sgs->group_type == group_misfit_task &&                    //如果是有misfit task的group
        (!group_smaller_max_cpu_capacity(sg, sds->local) ||        //当前sg的max_capacity * margin_up >= local_sg的max_capacity *1024;其中margin_up为1078，大约5%的up
             !group_has_capacity(env, &sds->local_stat)))            //或者group是否不满足has capacity
        return false;
                                                        //按照group_type判断
    if (sgs->group_type > busiest->group_type)            //group_type更大的才是busiest group
        return true;

    if (sgs->group_type < busiest->group_type)            //如果group_type小于busiest group的group_type，那必定不是busiest group
        return false;

    /*
     * This sg and busiest are classified as same. when prefer_spread    //当group_type相同时，需要进一步判断
     * is true, we want to maximize the chance of pulling taks, so        //如果有prefer_spread标记，那么我们就会最大化pull task的几率
     * prefer to pick sg with more runnable tasks and break the ties    //所以，更偏向于挑选有更多runnable task数量的sg，如果相等的情况下，用group_util来最终决定
     * with utilization.
     */
    if (env->prefer_spread) {
        if (sgs->sum_nr_running < busiest->sum_nr_running)
            return false;
        if (sgs->sum_nr_running > busiest->sum_nr_running)
            return true;
        return sgs->group_util > busiest->group_util;
    }

    if (sgs->avg_load <= busiest->avg_load)        //当上述条件无法找出差别时，再对比avg_load；如果<= busiets group，那就认为其肯定不是busiest group
        return false;

    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))    //如果当前sd是MC level(只有DIE level才有asym_cpucapacity flag)
        goto asym_packing;                            //那么执行goto

    /*
     * Candidate sg has no more than one task per CPU and        //候选的sg中task总数少于每个cpu至少一个task的情况
     * has higher per-CPU capacity. Migrating tasks to less        //并且，候选sg有比local group更高的cpu capacity
     * capable CPUs may harm throughput. Maximize throughput,        //迁移task导更小能力的cpu上会对throughput有影响
     * power/energy consequences are not considered.                //最大化throughput，那么就不应考虑功耗相关的后果
     */
    if (sgs->sum_nr_running <= sgs->group_weight &&            //group_weight其实就是sgs内cpu个数
        group_smaller_min_cpu_capacity(sds->local, sg))        //满足返回true：local sg->min_capacity * margin_up < 当前sg->min_capacity * 1024,其中margin_up为1078，大约5%的up
        return false;                                        //满足上述2个条件，则认为当前sg不是busiest group

    /*
     * Candidate sg doesn't face any severe imbalance issues so        //候选sg不遇到严重的不平衡，就不会进行打断；除非是groups之间相似的capacity，做balance会比不平衡的危害更小,那就做balance
     * don't disturb unless the groups are of similar capacity
     * where balancing is more harmless.
     */
    if (sgs->group_type == group_other &&                //当前sg是普通sg（不是不平衡较严重的sg）
        !group_similar_cpu_capacity(sds->local, sg))    //满足条件返回true：2个group对比min_capacity，差距<12.5%
        return false;                                    //满足上述2个条件，则认为当前sg不是busiest group

    /*
     * If we have more than one misfit sg go with the biggest misfit.    //如果有多个misfit sg，那么就选misfit最严重的sg为busiest；即过滤不严重的misfit
     */
    if (sgs->group_type == group_misfit_task &&                                //当前sg是misfit task的sg
        sgs->group_misfit_task_load < busiest->group_misfit_task_load)        //比较group_misfit_task_load，更大的才是busiest group
        return false;

asym_packing:

    /* This is the busiest node in its class. */
    if (!(env->sd->flags & SD_ASYM_PACKING))            //当前平台不支持这个flag，所以，这里必然返回true
        return true;
                                                            //下面的流程，就暂不分析了。
    /* No ASYM_PACKING if target CPU is already busy */
    if (env->idle == CPU_NOT_IDLE)
        return true;
    /*
     * ASYM_PACKING needs to move all the work to the highest
     * prority CPUs in the group, therefore mark all groups
     * of lower priority than ourself as busy.
     */
    if (sgs->sum_nr_running &&
        sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
        if (!sds->busiest)
            return true;

        /* Prefer to move from lowest priority CPU's work */
        if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
                      sg->asym_prefer_cpu))
            return true;
    }

    return false;
}

（2-2-2）计算imbalance值

/**
 * calculate_imbalance - Calculate the amount of imbalance present within the
 *             groups of a given sched_domain during load balance.
 * @env: load balance environment
 * @sds: statistics of the sched_domain whose imbalance is to be calculated.
 */
static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
    unsigned long max_pull, load_above_capacity = ~0UL;
    struct sg_lb_stats *local, *busiest;
    bool no_imbalance = false;

    local = &sds->local_stat;
    busiest = &sds->busiest_stat;

    if (busiest->group_type == group_imbalanced) {                        //如果busiest group处于imbalnace状态
        /*
         * In the group_imb case we cannot rely on group-wide averages    //不能再使用group等级的avg load来保证cpu load平衡
         * to ensure CPU-load equilibrium, look at wider averages. XXX    //所以需要更宽等级的avg load，即sds->avg_load
         */
        busiest->load_per_task =
            min(busiest->load_per_task, sds->avg_load);                    //busiest->load_per_task和sds->avg_load两者取小作为busiest的load_per_task
    }

    /*
     * Avg load of busiest sg can be less and avg load of local sg can    //sds/busiest/local avg load的计算除以的group cacacity的值可能不同
     * be greater than avg load across all sgs of sd because avg load    //并且当更新busiest sg时，busiest有更小group type的sgs，则直接return，不计算imbalance
     * factors in sg capacity and sgs with smaller group_type are
     * skipped when updating the busiest sg.
     */
    if (busiest->avg_load <= sds->avg_load ||                            //busiest avg load <= sds avg load
         local->avg_load >= sds->avg_load)                                //或者local avg load >= sds avg load
        no_imbalance = true;                                            //设置no_imbalance为 true，下面进一步判断

    if (busiest->group_type != group_misfit_task && no_imbalance) {        //如果busiest group没有misfit task且no_imbalance == true
        env->imbalance = 0;                                                //那么先强制设置imbalance为0
        if (busiest->group_type == group_overloaded &&                    //再判断busiest处于overload 且local的group type<= group_misfit_task（local相当于只有很小或者没有负载问题）
                local->group_type <= group_misfit_task) {
            env->imbalance = busiest->load_per_task;                    //上述那条满足，设置imbalance为busiest load_per_task
            return;                                                        //并直接return
        }
        return fix_small_imbalance(env, sds);                            //(2-2-2-1)如果上述条件不满足，则再计算小的imbalance
    }

    /*
     * If there aren't any idle CPUs, avoid creating some.    //如果没有任何idle cpu，避免迁移产生idle
     */
    if (busiest->group_type == group_overloaded &&                                //当busiest和local都处于overload时
        local->group_type   == group_overloaded) {
        load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;    //初步计算load_above_capacity = busiest中task数*1024
        if (load_above_capacity > busiest->group_capacity) {        //如果load_above_capacity > busiest的group capacity
            load_above_capacity -= busiest->group_capacity;            
            load_above_capacity *= scale_load_down(NICE_0_LOAD);    //则计算（load_above_capcity - busiest group capcity）*1024/busiestde group capacity。实际结果就是计算超过busiest group capcity的部分
            load_above_capacity /= busiest->group_capacity;            //如果busiest有只有1个task，那么这里计算得出的load_above_capacity也是0
        } else
            load_above_capacity = ~0UL;                                //如果有idle cpu，则不考虑load_above_capacity
    }

    /*
     * In case of a misfit task, independent of avg loads we do load balance    //万一有1个misfit task，独立于当前做load balance的sd level
     * at the parent sched domain level for B.L systems, so it is possible        //所以，有可能busiest的avg load < sd avg load
     * that busiest group avg load can be less than sd avg load.
     * So skip calculating load based imbalance between groups.                    //这种情况下，跳过计算group之间的imbalance
     */
    if (!no_imbalance) {
        /*
         * We're trying to get all the cpus to the average_load,            //我们正在尝试获取所有cpu的avg load，所以我们不想把自己搞成高于avg load
         * so we don't want to push ourselves above the average load,
         * nor do we wish to reduce the max loaded cpu below the average    //也不想将最大负载的cpu降到低于avg load
         * load. At the same time, we also don't want to reduce the            //同时，我们也不希望将group load降低到group capacity以下
         * group load below the group capacity.
         * Thus we look for the minimum possible imbalance.                    //因此，我们使用尽可能小的作为imbalance
         */
        max_pull = min(busiest->avg_load - sds->avg_load,                //取busiest->avg_load - sds->avg_load和load_above_capacity中较小的
                        load_above_capacity);

        /* How much load to actually move to equalise the imbalance */    //然后再计算local avg load与sds avg load的差值，并归一化
        env->imbalance = min(max_pull * busiest->group_capacity,        //同样取max_pull和上一行注释计算出的差值，乘上各自group capacity后的较小值。作为imbalance
                    (sds->avg_load - local->avg_load) *
                    local->group_capacity) /
                    SCHED_CAPACITY_SCALE;
    } else {
        /*
         * Skipped load based imbalance calculations, but let's find        //如果满足上面的skip条件，则跳过基于load计算imbalance
         * imbalance based on busiest group type or fix small imbalance.    //但是要计算基于busiest group type的imbalance或者计算小的imbalance
         */
        env->imbalance = 0;
    }

    /* Boost imbalance to allow misfit task to be balanced.        //boost imbalance可以运行对misfit task进行均衡
     * Always do this if we are doing a NEWLY_IDLE balance        //只要满足：我们是在做newly idle balance，且所有task都不会长时间运行（因此我们不能依靠load）
     * on the assumption that any tasks we have must not be
     * long-running (and hence we cannot rely upon load).
     * However if we are not idle, we should assume the tasks    //但是如果我们不是处于idle，
     * we have are longer running and not override load-based    //那我们应该假设task是长时间运行的，并且不会覆盖上面代码中基于load的计算
     * calculations above unless we are sure that the local        //除非我们保证local group没充分利用
     * group is underutilized.
     */
    if (busiest->group_type == group_misfit_task &&                //如果busiest有misfit task，并且满足下面的条件：
        (env->idle == CPU_NEWLY_IDLE ||                            //现在是newly idle balance，或者local的task数小于local拥有的cpu数
        local->sum_nr_running < local->group_weight)) {            
        env->imbalance = max_t(long, env->imbalance,            //那么就取之前基于load的imbalance，与busiest->group_misfit_task_load中的较大值，作为imbalance
                       busiest->group_misfit_task_load);
    }

    /*
     * if *imbalance is less than the average load per runnable task        //如果imbalance小于avg load per runnable task
     * there is no guarantee that any tasks will be moved so we'll have        //那么不能保证有task会被迁移
     * a think about bumping its value to force at least one task to be        //所以，我们会考虑提高一下imbalance值，来确保至少有一个task会被迁移
     * moved
     */
    if (env->imbalance < busiest->load_per_task) {                    //判断imbalance是否小于busiest->load_per_task
        /*
         * The busiest group is overloaded so it could use help                //busiest处于overload的话，它可以从其他group得到帮助
         * from the other groups. If the local group has idle CPUs            //如果local有idle cpu，并且local没有overload，所在group也没有imbalance
         * and it is not overloaded and has no imbalance with in            //那么允许提高imbalance来触发load balance
         * the group, allow the load balance by bumping the
         * imbalance.
         */
        if (busiest->group_type == group_overloaded &&
            local->group_type <= group_misfit_task &&
            env->idle != CPU_NOT_IDLE) {
            env->imbalance = busiest->load_per_task;        //最终提高imbalance = busiest->load_per_task（估计至少会迁移1个task）
            return;
        }

        return fix_small_imbalance(env, sds);                //上述条件不满足，则计算小的imbalance
    }
}

(2-2-2-1)计算小的、次要的imbalance。实际满足条件就是将busiest->load_per_task作为imbalance，类似是触发load balance的二次机会

/**
 * fix_small_imbalance - Calculate the minor imbalance that exists
 *            amongst the groups of a sched_domain, during
 *            load balancing.
 * @env: The load balancing environment.
 * @sds: Statistics of the sched_domain whose imbalance is to be calculated.
 */
static inline
void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
    unsigned long tmp, capa_now = 0, capa_move = 0;
    unsigned int imbn = 2;
    unsigned long scaled_busy_load_per_task;
    struct sg_lb_stats *local, *busiest;

    local = &sds->local_stat;
    busiest = &sds->busiest_stat;

    if (!local->sum_nr_running)                                            //如果local group没有task
        local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);        //则设置local->load_per_task = dst_cpu的pelt负载 / dst_cpu的cfs task总数
    else if (busiest->load_per_task > local->load_per_task)            //如果busiest load_per_task > local的
        imbn = 1;                                                    //则设置imbn = 1

    scaled_busy_load_per_task =                                        //归一化busiest的load_per_task(scaled_busy_load_per_task) = busiest->load_per_task *1024 / busiest的group_capacity
        (busiest->load_per_task * SCHED_CAPACITY_SCALE) /            //这个应该就是相当于迁移走的busy load部分（归一化）
        busiest->group_capacity;

    if (busiest->avg_load + scaled_busy_load_per_task >=            //如果busiest->avg_load + scaled_busy_load_per_task >= local->avg_load +scaled_busy_load_per_task *imbn(倍数)
        local->avg_load + (scaled_busy_load_per_task * imbn)) {
        env->imbalance = busiest->load_per_task;                    //设imbalance为busiest->load_per_task
        return;
    }

    /*
     * OK, we don't have enough imbalance to justify moving tasks,        //如果，我们没有足够的imbalance来决定迁移task
     * however we may be able to increase total CPU capacity used by    //但我们可以通过迁移他们来提高总的被使用的cpu capacity
     * moving them.
     */
                                                                    //下面这3次capa_now的计算统计了busiest和local的用量总和
    capa_now += busiest->group_capacity *                            //如果用的是avg load，相当于计算回原来的group load
            min(busiest->load_per_task, busiest->avg_load);            //如果用的是load_per_task，那计算的是load_per_task * group_capacity /1024
    capa_now += local->group_capacity *
            min(local->load_per_task, local->avg_load);
    capa_now /= SCHED_CAPACITY_SCALE;

    /* Amount of load we'd subtract */                            //我们要减去的量(capa_move)
    if (busiest->avg_load > scaled_busy_load_per_task) {        //如果avg_load > scaled_busy_load，说明有可以迁移的load
        capa_move += busiest->group_capacity *                    //计算可以迁移的load = busiest group_capacity * （load_per_task和avg_load- scaled_busy_load的较小者）
                min(busiest->load_per_task,
                busiest->avg_load - scaled_busy_load_per_task);    //选较小的作为迁移量
    }

    /* Amount of load we'd add */                                //我们要增加的量(tmp)
    if (busiest->avg_load * busiest->group_capacity <            //判断 group_load 是否小于 load_per_task？（也是选择较小者）
        busiest->load_per_task * SCHED_CAPACITY_SCALE) {
        tmp = (busiest->avg_load * busiest->group_capacity) /    //busiest->group_load / local->group_capacity
              local->group_capacity;
    } else {
        tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) /    //busiest->load_per_task *1024 / local->group_capacity
              local->group_capacity;
    }
    capa_move += local->group_capacity *                        //计算迁移之后的用量总和（capa_move）
            min(local->load_per_task, local->avg_load + tmp);    //计算方法：local->group_capacity   * (load_per_task和avg_load+tmp的较小者) / 1024
    capa_move /= SCHED_CAPACITY_SCALE;

    /* Move if we gain throughput */
    if (capa_move > capa_now) {                            //判断移动之后，是否会将整体的用量总和是否会提高（目标看上去是尽量提高使用率）
        env->imbalance = busiest->load_per_task;
        return;
    }

    /* We can't see throughput improvement with the load-based        //我们不能用基于load的方法看到提高用量的话，
     * method, but it is possible depending upon group size and        //但是依赖group size和capacity范围，在有不同cpu capacity的系统上，
     * capacity range that there might still be an underutilized    //仍可能有一个没有充分利用的cpu。
     * cpu available in an asymmetric capacity system. Do one last    //所以，再做最后一次case检查
     * check just in case.
     */
    if (env->sd->flags & SD_ASYM_CPUCAPACITY &&                //当前是DIE level
        busiest->group_type == group_overloaded &&            //busiest group处于overload
        busiest->sum_nr_running > busiest->group_weight &&    //busiest上task数量超过group内cpu核数
        local->sum_nr_running < local->group_weight &&        //local上task数量小于group内cpu核数
        local->group_capacity < busiest->group_capacity)    //local的group capacity < busiest的group capacity
        env->imbalance = busiest->load_per_task;
}

（2-3）从busiest group与env->cpus交集中，遍历找出busiest rq中，即负载最重cpu对应的rq。

/*
 * find_busiest_queue - find the busiest runqueue among the CPUs in the group.
 */
static struct rq *find_busiest_queue(struct lb_env *env,
                     struct sched_group *group)
{
    struct rq *busiest = NULL, *rq;
    unsigned long busiest_load = 0, busiest_capacity = 1;
    int i;

    for_each_cpu_and(i, sched_group_span(group), env->cpus) {    //从busiest group与env->cpus的交集中遍历cpu
        unsigned long capacity, load;
        enum fbq_type rt;

        rq = cpu_rq(i);
        rt = fbq_classify_rq(rq);            //我们没有打开numa balancing，所以这里永远为regular（0）

        /*
         * We classify groups/runqueues into three groups:
         *  - regular: there are !numa tasks
         *  - remote:  there are numa tasks that run on the 'wrong' node
         *  - all:     there is no distinction
         *
         * In order to avoid migrating ideally placed numa tasks,
         * ignore those when there's better options.
         *
         * If we ignore the actual busiest queue to migrate another
         * task, the next balance pass can still reduce the busiest
         * queue by moving tasks around inside the node.
         *
         * If we cannot move enough load due to this classification
         * the next pass will adjust the group classification and
         * allow migration of more tasks.
         *
         * Both cases only affect the total convergence complexity.
         */
        if (rt > env->fbq_type)            //这里永远不会满足，不会continue
            continue;

        /*
         * For ASYM_CPUCAPACITY domains with misfit tasks we simply    //如果busiest是misfit的情况，那只要找最大的misfit task
         * seek the "biggest" misfit task.
         */
        if (env->src_grp_type == group_misfit_task) {        //判断busiest
            if (rq->misfit_task_load > busiest_load) {        //如果rq的misfit_task_load 大于busiest_load(初始为0)
                busiest_load = rq->misfit_task_load;        //则将misfit task load赋值给busiest_load
                busiest = rq;                                //拥有最大的misift load的rq也就是最繁忙的
            }

            continue;
        }

        /*
         * Ignore cpu, which is undergoing active_balance and doesn't        //cpu如果处于active load balance而且task数<=2
         * have more than 2 tasks.                                            //就直接跳过，肯定不是busiest rq
         */
        if (rq->active_balance && rq->nr_running <= 2)
            continue;

        capacity = capacity_of(i);            //获取cpu_capacity，即留给CFS task的cpu capacity

        /*
         * For ASYM_CPUCAPACITY domains, don't pick a CPU that could        //对于拥有大小核的DIE level，不要选择最终导致
         * eventually lead to active_balancing high->low capacity.            //active balance 从high capcity到low 迁移
         * Higher per-CPU capacity is considered better than balancing        //因为越高per-cpu的capacity比做平均load balance更好
         * average load.
         */
        if (env->sd->flags & SD_ASYM_CPUCAPACITY &&                //如果是在DIE level，而且
            capacity_of(env->dst_cpu) < capacity &&                //目标cpu的cpu capacity < 该循环遍历到的cpu capacity，而且满足：
            (rq->nr_running == 1 ||                                //rq只有一个task 或者 rq有2个task但rq上curr的task util < sched_small_task_threshold（102，约等于10%最大cpu capacity）
                (rq->nr_running == 2 && task_util(rq->curr) <
                sched_small_task_threshold)))
            continue;

        load = cpu_runnable_load(rq);        //获取该循环遍历的cpu负载

        /*
         * When comparing with imbalance, use cpu_runnable_load()    //用没有基于cpu capacity归一化后的cpu负载与imbalance对比
         * which is not scaled with the CPU capacity.
         */

        if (rq->nr_running == 1 && load > env->imbalance &&            //如果rq只有1个task，并且load > imbalance
            !check_cpu_capacity(rq, env->sd))                        //并且是否因副作用导致capacity减少：rq的cpu_capacity * 对应sd的imbalance_pct >= cpu_capacity_orig *100
            continue;

        /*
         * For the load comparisons with the other CPU's, consider        //对于对于不同cpu之间比较负载的情况，考虑
         * the cpu_runnable_load() scaled with the CPU capacity, so        //使用经过cpu capacity归一化后的cpu负载，
         * that the load can be moved away from the CPU that is            //这样就能把负载从更低capacity的cpu上迁移走
         * potentially running at a lower capacity.
         *
         * Thus we're looking for max(load_i / capacity_i), crosswise    //因此我们寻找（load/capacity）的最大值
         * multiplication to rid ourselves of the division works out    //利用交叉相乘：a/b > c/d ---> a*d > c*b
         * to: load_i * capacity_j > load_j * capacity_i;  where j is    //最终a/b对应的cpu rq就是最大值
         * our previous maximum.
         */
        if (load * busiest_capacity >= busiest_load * capacity) {    //如上面所述，进行比较
            busiest_load = load;                                    //遍历将busiest load保存
            busiest_capacity = capacity;                            //将busiest cpu_capacity保存
            busiest = rq;                                            //保存busiest rq
        }
    }

    return busiest;
}

（2-4）进行迁移detach操作，从原rq脱离，返回的是迁移的task数量

/*
 * detach_tasks() -- tries to detach up to imbalance runnable load from
 * busiest_rq, as part of a balancing operation within domain "sd".
 *
 * Returns number of detached tasks if successful and 0 otherwise.
 */
static int detach_tasks(struct lb_env *env)
{
    struct list_head *tasks = &env->src_rq->cfs_tasks;    //取出繁忙的rq中取出cfs rq队列
    struct task_struct *p;
    unsigned long load = 0;
    int detached = 0;
    int orig_loop = env->loop;

    lockdep_assert_held(&env->src_rq->lock);

    if (env->imbalance <= 0)        //过滤没有imbalance的情况，不进行task迁移
        return 0;

    if (!same_cluster(env->dst_cpu, env->src_cpu))            //如果迁移是跨cluster的
        env->flags |= LBF_IGNORE_PREFERRED_CLUSTER_TASKS;    //置位flag：LBF_IGNORE_PREFERRED_CLUSTER_TASKS

    if (capacity_orig_of(env->dst_cpu) < capacity_orig_of(env->src_cpu))    //如果是从orig cpu capacity大的cpu迁移到小的cpu
        env->flags |= LBF_IGNORE_BIG_TASKS;                                    //置位flag：LBF_IGNORE_BIG_TASKS

redo:
    while (!list_empty(tasks)) {    //遍历cfs task
        /*
         * We don't want to steal all, otherwise we may be treated likewise,    //我们不想移动所有task，否则我们同样可能被欺骗
         * which could at worst lead to a livelock crash.                         //从而最差可能导致活锁crash
         */
        if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)    //idle或者newly idle 并且src rq中cfs task <=1，就跳出
            break;

        p = list_last_entry(tasks, struct task_struct, se.group_node);    //取队列中最后一个task p

        env->loop++;        //detach task的循环检查遍历统计（并不一定真正detach）
        /* We've more or less seen every task there is, call it quits */
        if (env->loop > env->loop_max)        //如果遍历次数超过了最大限制，就跳出
            break;

        /* take a breather every nr_migrate tasks */
        if (env->loop > env->loop_break) {                //当迁移检查task的数量超过sched_nr_migrate_break
            env->loop_break += sched_nr_migrate_break;    //则需要休息一会
            env->flags |= LBF_NEED_BREAK;                //置位flag：LBF_NEED_BREAK
            break;
        }

        if (!can_migrate_task(p, env))        //（2-4-1）判断task p是否能迁移到当前this cpu上
            goto next;

        /*
         * Depending of the number of CPUs and tasks and the    //根据cpu、task和cgroup的层级，
         * cgroup hierarchy, task_h_load() can return a null    //task_h_load()可以返回NULL。
         * value. Make sure that env->imbalance decreases        //所以，为了确保env->imbalance能下降，
         * otherwise detach_tasks() will stop only after        //否则detach_tasks()会知道loop_max才能停下来
         * detaching up to loop_max tasks.
         */
        load = max_t(unsigned long, task_h_load(p), 1);        //限制最小load = 1


        if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)    //LB_MIN特性用于限制迁移小任务，如果LB_MIN等于true，那么task load小于16的任务将不参与负载均衡。目前LB_MIN系统缺省设置为false
            goto next;

        /*
         * p is not running task when we goes until here, so if p is one    //当跑到这里时，task p就不处于running了
         * of the 2 task in src cpu rq and not the running one,                //所以，task p是src rq其中之一，并且是处于非running的一个
         * that means it is the only task that can be balanced.                //这意味着只有这个tasky p可以进行均衡迁移。
         * So only when there is other tasks can be balanced or                //所以，仅当有其他task能迁移（task数量>2），
         * there is situation to ignore big task, it is needed                //或者需要忽略big task的场景，
         * to skip the task load bigger than 2*imbalance.                    //满足上述条件后，如果task load 大于 2倍的imbalance，那么就skip，不考虑task p
         *
         * And load based checks are skipped for prefer_spread in        //并且对于prefer_spread，会跳过基于load的busiest group寻找检查
         * finding busiest group, ignore the task's h_load.                //这部分也会忽略
         */
        if (!env->prefer_spread &&                            //1.没有设置prefer_spread
            ((cpu_rq(env->src_cpu)->nr_running > 2) ||        //2.src rq的cfs task超过2
            (env->flags & LBF_IGNORE_BIG_TASKS)) &&            //  或者 设置了忽略big task的flag
            ((load / 2) > env->imbalance))                    //3.load > 2倍imbalance
            goto next;                                        //同事满足1.2.3.条件，则跳过该task p

        detach_task(p, env);                        //（2-4-2）根据迁移env，进行task剥离
        list_add(&p->se.group_node, &env->tasks);    //剥离后的task，同时先挂在env->tasks链表上

        detached++;                    //task剥离的计数统计
        env->imbalance -= load;        //剥离之后，计算剩余的imbalance

#ifdef CONFIG_PREEMPTION
        /*
         * NEWIDLE balancing is a source of latency, so preemptible            //newly idle的情况，对latency要求高。如果有抢占功能的kernel，
         * kernels will stop after the first task is detached to minimize    //则当第一个task剥离之后，需要停止进一步的剥离。
         * the critical section.
         */
        if (env->idle == CPU_NEWLY_IDLE)
            break;
#endif

        /*
         * We only want to steal up to the prescribed amount of        //我们只想平衡之前提到的load量，不要过度迁移
         * runnable load.
         */
        if (env->imbalance <= 0)            //imbalance消除后，就不再需要过度detach了
            break;

        continue;        //如果imbalance没有抹平，或者没有达到跳出循环的条件，就继续进行下一个task检查尝试detach
next:
#ifdef CONFIG_SCHED_WALT
        trace_sched_load_balance_skip_tasks(env->src_cpu, env->dst_cpu,
                env->src_grp_type, p->pid, load, task_util(p),
                cpumask_bits(&p->cpus_mask)[0]);
#endif
        list_move(&p->se.group_node, tasks);            //走到这里说明上面走到了skip的条件，把检查过的task p塞回tasks队列head（因为检查是从list尾开始的）
    }

    if (env->flags & (LBF_IGNORE_BIG_TASKS |　　　　　　　　　　　　　　//如果env->flags中设置了ignore bit task或者ignore perferred cluster task
            LBF_IGNORE_PREFERRED_CLUSTER_TASKS) && !detached) {　　 //并且此次没有满足条件的task detach（detach task数量为0）
        tasks = &env->src_rq->cfs_tasks;　　　　　　　　　　　　//重新赋值src rq上的cfs task链表
        env->flags &= ~(LBF_IGNORE_BIG_TASKS |　　　　　　　　 //去掉这2个flag
                LBF_IGNORE_PREFERRED_CLUSTER_TASKS);
        env->loop = orig_loop;　　　　　　　　　　　　　　　　　　//重新赋值原始的loop计数，之前的遍历计数不算
        goto redo;
    }

    /*
     * Right now, this is one of only two places we collect this stat
     * so we can safely collect detach_one_task() stats here rather
     * than inside detach_one_task().
     */
    schedstat_add(env->sd->lb_gained[env->idle], detached);        //统计调度相关统计数据：detach task的数量

    return detached;　　　　//返回的是detach的task数量
}

（2-4-1）判断task p是否能迁移到当前this cpu上

/*
 * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
 */
static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
    int tsk_cache_hot;
    int can_migrate = 1;

    lockdep_assert_held(&env->src_rq->lock);

    trace_android_rvh_can_migrate_task(p, env->dst_cpu, &can_migrate);    //GKI的trace hook，当作没有即可
    if (!can_migrate)
        return 0;

    /*
     * We do not migrate tasks that are:                        //我们有以下这几种情况是不会迁移task的：
     * 1) throttled_lb_pair, or                                  //1.throttled_lb_pair
     * 2) cannot be migrated to this CPU due to cpus_ptr, or     //2.cpuset限制不能迁移
     * 3) running (obviously), or                                //3.task正处于running
     * 4) are cache-hot on their current CPU.                    //4.在当前cpu上是cache hot
     */
    if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))    //判断src rq、dst rq是否处于cfs带宽限制状态？如果是，则不迁移
        return 0;

    /*
     * don't allow pull boost task to smaller cores.    //不允许把boost task从大核迁移到小核
     */
    if (!can_migrate_boosted_task(p, env->src_cpu, env->dst_cpu))    //task处于task boost，且task处于rtg中，且dst cpu的orig_capacity < src cpu的
        return 0;

#ifdef CONFIG_SCHED_WALT
    if (p->wts.iowaited && is_min_capacity_cpu(env->dst_cpu) &&        //task处于iowait，且dst cpu是最小核，且src cpu不是最小核
            !is_min_capacity_cpu(env->src_cpu))
        return 0;
#endif

    if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {        //受cpuset限制，dst cpu不能运行task p
        int cpu;

        schedstat_inc(p->se.statistics.nr_failed_migrations_affine);    //统计因亲和度导致迁移失败的调度信息统计

        env->flags |= LBF_SOME_PINNED;        //flag记录 some pinned

        /*
         * Remember if this task can be migrated to any other CPU in    //记住如果这个task能被迁移到我们sg的任一cpu
         * our sched_group. We may want to revisit it if we couldn't    //而且如果从src cpu上pull其他task不能解决imbalance
         * meet load balance goals by pulling other tasks on src_cpu.    //那么我们可以再尝试这个task
         *
         * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have        //避免考虑newly idle情况下设置new_dst_cpu
         * already computed one in current iteration.                    //或者如果我们已经在当前这次遍历中已经设置过了（通过判断LBF_DST_PINNED）
         */
        if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
            return 0;

        /* Prevent to re-select dst_cpu via env's CPUs: */        //防止通过env->cpus重新选择dst cpu
        for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {    //遍历dst_grpmask与env->cpus的交集
            if (cpumask_test_cpu(cpu, p->cpus_ptr)) {        //判断新cpu是否满足task p的cpuset限制
                env->flags |= LBF_DST_PINNED;                //如满足，则设置flag：LBF_DST_PINNED
                env->new_dst_cpu = cpu;                        //并将其设置为new_dst_cpu
                break;
            }
        }

        return 0;
    }

    /* Record that we found atleast one task that could run on dst_cpu */
    env->flags &= ~LBF_ALL_PINNED;        //跑到这里，说明至少有一个task而已跑到dst cpu上

#ifdef CONFIG_SCHED_WALT
    if (static_branch_unlikely(&sched_energy_present)) {                //如果开启了EAS
        struct root_domain *rd = env->dst_rq->rd;

        if ((rcu_dereference(rd->pd) && !sd_overutilized(env->sd)) &&    //perf domain存在，且env->sd不处于overutil
            env->idle == CPU_NEWLY_IDLE && !env->prefer_spread &&        //且是newly idle，且没有设置prefer_spread
            !task_in_related_thread_group(p)) {                            //且task p处于rtg
            long util_cum_dst, util_cum_src;
            unsigned long demand;

            demand = task_util(p);                                    //获取task p的util
            util_cum_dst = cpu_util_cum(env->dst_cpu, 0) + demand;    //计算迁移task p后的dst cumlative_cpu_util
            util_cum_src = cpu_util_cum(env->src_cpu, 0) - demand;    //计算迁移task p后的src cumlative_cpu_util

            if (util_cum_dst > util_cum_src)        //不能迁移后，dst cumlative_cpu_util > src cumlative_cpu_util
                return 0;
        }
    }

    if (env->flags & LBF_IGNORE_PREFERRED_CLUSTER_TASKS &&        //如果是跨cluster的迁移，
             !preferred_cluster(                                //且task p设置的优先调度cluster与dst cpu所在cluster不匹配
                cpu_rq(env->dst_cpu)->wrq.cluster, p))
        return 0;

    /* Don't detach task if it doesn't fit on the destination */
    if (env->flags & LBF_IGNORE_BIG_TASKS &&            //从orig cpu capacity大的cpu迁移到小的cpu的情况
        !task_fits_max(p, env->dst_cpu))                //且task不适合dst cpu capacity
        return 0;

    /* Don't detach task if it is under active migration */
    if (env->src_rq->wrq.push_task == p)        //task p正在处于active migration
        return 0;
#endif

    if (task_running(env->src_rq, p)) {            //task p正处于running
        schedstat_inc(p->se.statistics.nr_failed_migrations_running);    //统计因task处于running导致迁移失败的调度信息统计
        return 0;
    }

#ifdef CONFIG_SCHED_WALT
    if ((env->idle == CPU_NEWLY_IDLE) &&            //newly idle的情况
        is_min_capacity_cpu(env->dst_cpu) &&        //且dst是小核
        !is_min_capacity_cpu(env->src_cpu) &&        //且src是大核
        walt_get_rtg_status(p)) {                    //且task p在rtg内，同时设置了大核cluster的偏好
        bool pull_to_silver_allowed = false;
        unsigned int cpu;

        for_each_cpu(cpu, env->cpus) {            //遍历env->cpus
            if (!is_min_capacity_cpu(cpu) &&    //不是小核，且
                cpu_overutilized(cpu)) {        //cpu处于overutil
                pull_to_silver_allowed = true;    //那么可以考虑从大核迁移到小核
                break;
            }
        }

        if (!pull_to_silver_allowed)    //如果上面条件判断不满足，则跳出
            return 0;
    }
#endif

    /*
     * Aggressive migration if:                 //同意激进地迁移task的条件，满足一下任一：
     * 1) IDLE or NEWLY_IDLE balance.           //1.idle或者newly idle balance
     * 2) destination numa is preferred         //2.dst满足是numa偏向（当前平台UMA，永不满足这一条）
     * 3) task is cache cold, or                //3.task处于cache cold(其实是task hot)
     * 4) too many balance attempts have failed.//4.有太多balance尝试失败了
     */
    tsk_cache_hot = migrate_degrades_locality(p, env);        //因为当前平台是UMA架构，所以永远返回-1
    if (tsk_cache_hot == -1)
        tsk_cache_hot = task_hot(p, env);        //判断是否task p是否task_hot

    if (env->idle != CPU_NOT_IDLE || tsk_cache_hot <= 0 ||            //idle或者newly idle，或者task是cold
        env->sd->nr_balance_failed > env->sd->cache_nice_tries) {    //或者失败balance的次数超过了阈值（每个sd不同）
        if (tsk_cache_hot == 1) {                                        //如果该task是task-hot
            schedstat_inc(env->sd->lb_hot_gained[env->idle]);            //则记录相关统计信息：lb_hot_gained[env->idle]
            schedstat_inc(p->se.statistics.nr_forced_migrations);        //nr_forced_migrations
        }
        return 1;
    }

    schedstat_inc(p->se.statistics.nr_failed_migrations_hot);    //如果最终失败了，则记录统计信息：nr_failed_migrations_hot
    return 0;
}

这里简单看一下task hot的判断过程：

不是cfs task，不是hot
task的调度policy为idle policy，不是hot
task在原cpu上即将被调度到（已经设置为next或last buddy），是hot。------注意这里是CFS Buddy的概念，不是内存管理的伙伴系统。
迁移阈值sysctl_sched_migration_cost为-1，则是hot
迁移阈值sysctl_sched_migration_cost为0，则不是hot
运行时间 < 迁移阈值sysctl_sched_migration_cost，是task_hot；反之，不是hot

/*
 * Is this task likely cache-hot:
 */
static int task_hot(struct task_struct *p, struct lb_env *env)
{
    s64 delta;

    lockdep_assert_held(&env->src_rq->lock);

    if (p->sched_class != &fair_sched_class)
        return 0;

    if (unlikely(task_has_idle_policy(p)))
        return 0;

    /*
     * Buddy candidates are cache hot:
     */
    if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&
            (&p->se == cfs_rq_of(&p->se)->next ||
             &p->se == cfs_rq_of(&p->se)->last))
        return 1;

    if (sysctl_sched_migration_cost == -1)
        return 1;
    if (sysctl_sched_migration_cost == 0)
        return 0;

    delta = rq_clock_task(env->src_rq) - p->se.exec_start;

    return delta < (s64)sysctl_sched_migration_cost;
}

（2-4-2）根据迁移env，进行task剥离

/*
 * detach_task() -- detach the task for the migration specified in env
 */
static void detach_task(struct task_struct *p, struct lb_env *env)
{
    lockdep_assert_held(&env->src_rq->lock);

    deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);        //（2-4-2-1）dequeue task出队
    lockdep_off();
    double_lock_balance(env->src_rq, env->dst_rq);            //两个rq一起上锁
    if (!(env->src_rq->clock_update_flags & RQCF_UPDATED))
        update_rq_clock(env->src_rq);        //更新src rq的rq clock
    set_task_cpu(p, env->dst_cpu);                            //（2-4-2-2）迁移工作
    double_unlock_balance(env->src_rq, env->dst_rq);        //两个rq一起解锁
    lockdep_on();
}

（2-4-2-1）dequeue task出队

void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
    p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;    //当前detach task的路径走到这里，on_rq会置为TASK_ON_RQ_MIGRATING（2）

    if (task_contributes_to_load(p))    //根据task->state确认处于uninterruptible，且flags不处于freeze，且state不处于TASK_NOLOAD
        rq->nr_uninterruptible++;        //则增加rq上阻塞task计数+1（说明task迁移的中途是算处于uninterrupted）

#ifdef CONFIG_SCHED_WALT
    if (flags & DEQUEUE_SLEEP)        //当前没有这个flag，所以不会走清理ed task
        clear_ed_task(p, rq);
#endif

    dequeue_task(rq, p, flags);        //将task出rq
}

（2-4-2-2）迁移工作：

更新src rq、dst rq、task p的walt负载。并从src rq上减去p的负载，再把p的负载加到dst rq上
改变task p的cfs rq和parent entity
设置p->cpu、->wake_cpu为dst cpu

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
...（这里去掉了一些debug的代码）
    trace_sched_migrate_task(p, new_cpu);

    if (task_cpu(p) != new_cpu) {                    //判断是迁移task的情况
        if (p->sched_class->migrate_task_rq)
            p->sched_class->migrate_task_rq(p, new_cpu);    //cfs task调用migrate_task_rq_fair
        p->se.nr_migrations++;                        //调度统计task p迁移的次数
        rseq_migrate(p);                    //设置rseq状态：RSEQ_EVENT_MIGRATE_BIT、task flag：TIF_NOTIFY_RESUME
        perf_event_task_migrate(p);            //设置task->sched_migrated = 1

        fixup_busy_time(p, new_cpu);        //更新walt负载，并从src rq中去掉，加到dst rq上。较复杂，暂不展开了
    }

    __set_task_cpu(p, new_cpu);                //通过set_task_rq改变所在的cfs_rq和parent entity，并设置p->cpu、wake_cpu为dst cpu
}

set_task_rq就是更新pelt，更改cfs rq/rt rq和parent se/parent rt_se

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
    struct task_group *tg = task_group(p);
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);    //以src rq的时间戳更新task p的pelt负载，然后再设置task p的pelt最后更新的时间戳为dst rq的时间戳
    p->se.cfs_rq = tg->cfs_rq[cpu];        //修改task p的cfs rq为tg中的dst cpu对应的cfs_rq
    p->se.parent = tg->se[cpu];            //修改task p的parent se为tg中的dst cpu对应的se
#endif

#ifdef CONFIG_RT_GROUP_SCHED
    p->rt.rt_rq  = tg->rt_rq[cpu];        //修改task p的rt rq为tg中的dst cpu对应的rt_rq
    p->rt.parent = tg->rt_se[cpu];        //修改task p的parent rt se为tg中的dst cpu对应的rt_se
#endif
}

(2-5）将所有task attach到新的rq上

/*
 * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
 * new rq.
 */
static void attach_tasks(struct lb_env *env)
{
    struct list_head *tasks = &env->tasks;        //取出detach的所有task的链表
    struct task_struct *p;
    struct rq_flags rf;

    rq_lock(env->dst_rq, &rf);
    update_rq_clock(env->dst_rq);            //更新dst rq的clock

    while (!list_empty(tasks)) {
        p = list_first_entry(tasks, struct task_struct, se.group_node);        //从head取一个task
        list_del_init(&p->se.group_node);        //释放group_node

        attach_task(env->dst_rq, p);            //（2-5-1）将task p attach到dst cpu的rq上
    }

    /*
     * The enqueue_task_fair only updates the overutilized status    //enqueue_task_fair函数只会为唤醒的task更新overutil 状态
     * for the waking tasks. Since multiple tasks may get migrated  //因为复数task触发迁移，若不是在那运行
     * from load balancer, instead of doing it there, update the    //所以要在最终阶段更新overutil状态
     * overutilized status here at the end.
     */
    update_overutilized_status(env->dst_rq);    //（2-5-2）更新迁移过后的dst rq的overutil状态
    rq_unlock(env->dst_rq, &rf);
}

（2-5-1）将task p attach到dst cpu的rq上

/*
 * attach_task() -- attach the task detached by detach_task() to its new rq.
 */
static void attach_task(struct rq *rq, struct task_struct *p)
{
    lockdep_assert_held(&rq->lock);

    BUG_ON(task_rq(p) != rq);
    activate_task(rq, p, ENQUEUE_NOCLOCK);    //（2-5-1-1）以不更新rq clock的flag，将task p压入rq
    check_preempt_curr(rq, p, 0);            //最终调用cfs class的check_preempt_wakeup函数来判断是否要抢占dst rq上的curr进程
}

（2-5-1-1）以不更新rq clock的flag，将task p压入rq

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
    if (task_contributes_to_load(p))    //根据task->state确认处于uninterruptible，且flags不处于freeze，且state不处于TASK_NOLOAD
        rq->nr_uninterruptible--;        //则增加rq上阻塞task计数11（说明task迁移的中途是算处于uninterrupted）

    enqueue_task(rq, p, flags);            //将task入rq

    p->on_rq = TASK_ON_RQ_QUEUED;        //on_rq设为1
}

（2-5-2）更新迁移过后的dst rq的overutil状态

static inline void update_overutilized_status(struct rq *rq)
{
#ifdef CONFIG_SCHED_WALT
    struct sched_domain *sd;

    rcu_read_lock();
    sd = rcu_dereference(rq->sd);        //获取dst rq->sd，即MC level的sd
    if (sd && !sd_overutilized(sd) &&    //判断sd存在，且sd没有设置overutil flag：sd->shared->overutilized
        cpu_overutilized(rq->cpu))        //且dst cpu处于overutil：orig_cpu_capcity * 1024 < cpu_util *1078   (相当于105%)
        set_sd_overutilized(sd);        //设置sd over util flag：sd->shared->overutilized = true
    rcu_read_unlock();
#else
    if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
        WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
        trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
    }
#endif
}

（2-6）判断是否需要active balance

static int need_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    if (voluntary_active_balance(env))        //（2-6-1）判断是否满足主动触发active balance的条件
        return 1;

    if ((env->idle != CPU_NOT_IDLE) &&                                //src cpu处于idle状态
        (capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&    //src cpu_capacity < dst cpu_capacity
        ((capacity_orig_of(env->src_cpu) <                            //且src orig_cpu_capacity < dst orig_cpu_capacity
                capacity_orig_of(env->dst_cpu))) &&                    
                env->src_rq->cfs.h_nr_running == 1 &&                //且src cpu只有一个cfs task
                cpu_overutilized(env->src_cpu) &&                    //且src cpu处于overutil状态
                !cpu_overutilized(env->dst_cpu)) {                    //且dst cpu不处于overutil状态
        return 1;
    }

    if (env->src_grp_type == group_overloaded && env->src_rq->misfit_task_load)    //src cpu所在group处于overload，且src cpu有misfit task load
        return 1;

    return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);    //当前sd balance fail的失败计数 > 当前sd->cache_nice_tries+2
}

（2-6-1）判断是否满足主动触发active balance的条件

static inline bool
voluntary_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    if (asym_active_balance(env))    //平台不支持SMT level，也就不会有flag：SD_ASYM_PACKING。所以这里永远为false
        return 1;

    /*
     * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.            //dst cpu处于idle，src cpu只有1个cfs task
     * It's worth migrating the task if the src_cpu's capacity is reduced    //如果src cpu的capacity由于其他sched class或者irq
     * because of other sched_class or IRQs if more capacity stays            //并且dst cpu有更大的capacity，那么是值得迁移task的
     * available on dst_cpu.
     */
    if ((env->idle != CPU_NOT_IDLE) &&                //src cpu处于idle
        (env->src_rq->cfs.h_nr_running == 1)) {        //src cpu只有一个cfs task
        if ((check_cpu_capacity(env->src_rq, sd)) &&    //满足src cpu_capacity * sd->imbalance_pct < src cpu_capacity_orig*100
            (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))    //满足src cpu_capacity * sd->imbalance_pct < dst cpu_capacity*100
            return 1;
    }

    if (env->idle != CPU_NOT_IDLE &&                //如果src cpu处于idle
            env->src_grp_type == group_misfit_task)    //src cpu所在group type == group_misfit_task
        return 1;

    return 0;
}

还在成长。。。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。