if three is only one task remianning on the rb-tree and this one is with a very hight nice value.this time the min_vruntime is the vruntime of the large-nice task,seeing dequeue-function.If a new task is inserted into the tree after a wihle,the new task will be with a very large vruntime,actually it being wrong.

so every time the current task update its vruntime itself, cfs_rq->vruntime will also be updated,seeing update_min_vruntime.

but consider the logic if(se->vruntime == cfs_rq->vruntime) in that function,if the rb-tree's remainning task is a task with high nice value witch has ran a while before,its vrntime is large,and cfs_rq->vruntime will be set to the large one,the result is the same as older.

why this occur and why this is a bad thing,it's the current.if the vruntime is far away from curr->vruntime,curr will get more physical run time than it should do.

everything seems ok,seeing fowllowing:

从2.6.25开始,cfs调度器脱离了原始版本迈向了成熟,调度器不再跟踪每个进程在运行队列的等待时间,而是为每个进程引入了一个vruntime,这就是每个进程一个比以前版本更显然的虚拟时钟,每个进程的虚拟时钟都向前推进,只是速率不同,然后调度器选择最慢的来运行,这固然很好,然而2.6.25的实现却不尽人意,因为它的cfs_rq的min_vruntime向前推进的机制很容易引起两个进程的vruntime的距离拉的过大问题。在讨论问题前,首先要明白的是,cfs_rq的min_vruntime总是单调递增的,它永远不会倒流。

在2.6.25的代码中,cfs_rq的min_vruntime只会在新进程入队或进程出队的时候被更新,别的不说,最起码当前运行的进程的vruntime就是一个被忽略的vruntime,因为cfs采用线下运行,当前运行的进程要从红黑树上摘下来,那么当前运行的进程的vruntime就不被考虑了,即使它的vruntime再小也不被考虑,试想如果cpu的红黑树运行队列中只剩下了一个进程,该进程拥有很大的nice值,那么很容易得知它很可能拥有很大的vruntime,2.6.25的代码逻辑告诉我们,此时的cfs_rq的min_vruntime就是这个大nice值进程的vruntime,如果一个进程此时被创建而被插入到红黑树了,那么place_entity的逻辑告诉我们新进程将拥有一个很大的vruntime,vruntime太大意味着惩罚,难道新进程应该被惩罚吗?没有任何理由,因此这种策略肯定是不对的。因此,后续的版本做了改进,在当前进程的update_curr中更新cfs_rq的min_vruntime,也就是有了update_min_vruntime这个函数,但是想想里面的if(se->vruntime == cfs_rq->vruntime)逻辑,如果有一个唯一的进程剩余在红黑树中,该进程拥有很大的nice值也就是有很大的几率拥有很大的vruntime,如此它肯定在不久之前被运行过,这样它的虚拟时钟才会被推进到这个很大的位置,如此根据这个if判断逻辑,这个很大vruntime将被赋予cfs_rq的min_vruntime,机制min_vruntime是单调递增的,如此它的结果和2.6.25中的一样,引起对新进程和新唤醒进程的不公平。这一切的本质是什么?事情为何坏到极点,原因就是当前进程影响了一切,如果cfs_rq的min_vruntime远远超过了当前进程的vruntime,那么当前进程就会获得更多的物理时间,它不应该获得这么多的,但是它却获得了,这就是不公平,就是因为一个vruntime很大的进程在红黑树中就应该让当前进程运行很久吗?如果说因为那个进程的vruntime太大应该惩罚它的话,那么为何也连带着惩罚新入队的进程呢?实在不该!于是就去掉了if(se->vruntime == cfs_rq->vruntime) 判断,函数变成了下面的样子:

static void update_min_vruntime(struct cfs_rq *cfs_rq)

{

u64 vruntime = cfs_rq->min_vruntime;

if (cfs_rq->curr)

vruntime = cfs_rq->curr->vruntime;

if (cfs_rq->rb_leftmost) {

struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,

struct sched_entity,

run_node);

vruntime = min_vruntime(vruntime, se->vruntime);

}

cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);

}

but consider if current task dequeued from the tree.now it is no need to worriy about current getting more time,if curr want to requeue into the tree,its vruntime will be large too,haha.if cfs_rq->vruntime is also max_vruntime(cfs_rq->min_vruntime, vruntime) like above,it is unfair for the tasks with large vruntime still in tree,why?if a new task is inserted into the tree now,it shoud compete with the alary tasks in the tree.so keep the low distance between the curr and the most right task.

但是考虑一下当前进程由于睡眠或者别的原因出队了,现在前面讨论的担忧就不存在了,因为前面担心当前进程获得太多的运行时间,也仅仅担心当前进程,因为如果树中有进程的vrntime离得当前进程不远的话,那么cfs_rq的min_vruntime就不会选中那个vruntime很大的进程。既然不用担心当前进程了,那么此时按照上面的代码逻辑,if (cfs_rq->curr)肯定通不过,那么接下来的min_vruntime(vruntime, se->vruntime)逻辑中就会选中vruntime,也就是cfs_rq->min_vruntime作为vruntime,如此一来这种情况下cfs_rq的min_vruntime就不会向前推进,这样一来的结果看似和上面讨论的相反但是效果却是一样,如果此时一个新进程插入到了红黑树,那么它的vruntime将以这个很小的min_vruntime为基础,刚才是新进程拥有很大的vruntime显得不公平,现在新进程拥有很小的vruntime,这样对红黑树中右边的进程不公平,新进程应该和红黑树中已有的进程们竞争或者打成一片,而不应该依仗将要逝去的当前进程,因此此种情况下cfs的min_vruntime就要推进到那个很大的vruntime的进程的vruntime。

总而言之,为了保持公平,尽最大努力不要让两个进程的vruntime离得太远,一旦离得太远就会对前面的进程不公平。如果cfs_rq->vruntime不是及时更新而是直接取树中的leftmost的话,此leftmost很容易就和curr拉开了距离,这样新进程或睡醒进程插入时就吃亏了,结果就是当前进程运行更多的时间,于是提出了update_min_vruntime函数,每次update_curr时调用,提高了min_vruntime更新的频度和精度,但是反过来如果当前进程出队时如果不向前推进min_vruntime的话,新进程会以很低的vruntime插入到红黑树,这样同样会使红黑树中右边的进程饥饿,于是就有了以下的update_min_vruntime版本:

static void update_min_vruntime(struct cfs_rq *cfs_rq)

{

u64 vruntime = cfs_rq->min_vruntime;

if (cfs_rq->curr)

vruntime = cfs_rq->curr->vruntime;

if (cfs_rq->rb_leftmost) {

struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,struct sched_entity,run_node);

if (!cfs_rq->curr)

vruntime = se->vruntime;

else

vruntime = min_vruntime(vruntime, se->vruntime);

}

cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);

}

最后以一位高手的在邮件列表的话来结束本文:

I used to test CFS on my 1.2GHz laptop with 512M and a make -j5. That would result in a slow but steady system. With steady I mean the latency was pretty constant.The O(1) scheduler would utterly mess this up, it would be fast, until the desktop bloat took enough cpu time and then it would starve a while, etc..