Re: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler

From: Davide Libenzi (
Date: Sat Jan 05 2002 - 21:16:21 EST

On Sun, 6 Jan 2002, Ingo Molnar wrote:

> On Sat, 5 Jan 2002, Davide Libenzi wrote:
> > Can you send me a link, there're different things to be fixed IMHO.
> my latest stuff is at:
> > The load estimator can easily use the current dyn_prio/time_slice by
> > simplyfing things a _lot_
> i have experimented with a very high number of variants. I estimated sleep
> times, i estimated run times, i estimated runqueue times. Note that the
> current estimator measures time spent on the *runqueue*, not time spent on
> the CPU. This means that in an overload spike we have an automatically
> increasing penalization of tasks that want to run. While i'm not too
> emotional about eg. the RT bits, this part of the scheduler is pretty
> critical to handle high load smoothly.
> the integration effect of the estimator was written to be fast, and it's
> fast. Also note that in most of the time we do not even call the
> estimator:
> if (p->run_timestamp == jiffies)
> goto enqueue;
> ie. in high frequency wakeup situations we'll call into the estimator only
> once every jiffy.

Like the current one ( pre8 )
You've a per-cpu swap array counter ( old recalc loop ) that you increment
each time you swap arrays.
Each task struct has its rcl_last that is updated when you inject the task
on the run queue :

        p->dyn_prio += rcl_curr(task_qid) - p->rcl_last;
        p->rcl_last = rcl_curr(task_qid);
        if (p->dyn_prio > MAX_DYNPRIO) p->dyn_prio = MAX_DYNPRIO;

Something like this, and you'll push the task on the given queue depending
on 'prio' ( _unsigned_ with the code above ).
Each time a task exaust its time slice you decrease prio.
It works great and it's way simpler.

> > I would suggest a lower number of queues, 32 is way more than necessary.
> the reason i used more queues is the 'penalizing' effect of the per-task
> load-average estimator. We want to have some priority room these CPU-bound
> tasks can escape into, without hurting some of the interactive jobs that
> might get a few penalties here and there but still dont reach the maximum
> where all the CPU hogs live. (this is p->prio == 63 right now.)
> also, i wanted to map all the 39 nice values straight into the priority
> space, just to be sure. Some people *might* rely on finegrained priorities
> still.
> there is one additional thing i wanted to do to reduce the effect of the
> 64 queues: instead of using a straight doubly-linked list a'la list_t, we
> can do a head-pointer that cuts the queue size into half, and reduces
> cache footprint of the scheduler data structures as well. But i did not
> want to add this until all bugs are fixed, this is an invariant
> cache-footprint optimization.

Ingo, i did a lot of testing by studying the dyn_prio distribution.
You've a lot of tasks ( i/o bound ) moving between the very firsts ( top )

> > The rt code _must_ be better, it can be easily done by a smartest
> > wakeup. There's no need to acquire the whole lock set, at least w/out
> > a checkpoint solution ( look at BMQS ) that prevents multiple failing
> > lookups inside the RT queue.
> regarding SCHED_OTHER, i have intentionally avoided smart wakeups, pushed
> the balancing logic more into the load balancer.
> load spikes and big statistical fluctuations of runqueue lengths we should
> not care much about - they are spikes we cannot flatten anyway, they can
> be gone before the task has finished flushing over its data set to the
> other CPU.
> regarding RT tasks, i did not want to add something that i know is broken,
> even if rt_lock() is arguably heavyweight. I've seen the 'IPI in flight
> misses the real picture' situation a number of times and if we want to do
> RT scheduling seriously and accurately on SMP then we should give a
> perfect solution to it. Would you like me to explain the 'IPI in flight'
> problem in detail, or do you agree that it's a problem?

What's the 'IPI in flight' a new counter terrorism measure ? :)
I use a chack point inside BMQS.
There's a global variable that is incremented each time an RT task is woke
There's a local task struct checkpoint that is aligned to the global one
when a lookup inside the RT queue results empty :

    if (grt_chkp != rtt_chkp(cpu_number_map(this_cpu)) &&
        goto rt_queue_select;

We've to work on the rt code and the balancing code/hooks

- Davide

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Mon Jan 07 2002 - 21:00:30 EST