[announce] [patch] batch/idle priority scheduling, SCHED_BATCH

From: Ingo Molnar (mingo@elte.hu)
Date: Sun Jun 30 2002 - 19:26:42 EST


the attached patch adds a feature that was pretty high on the scheduler
features wishlist: it implements the functionality of SCHED_IDLE, in a
safe way. Another desired scheduler feature was batch scheduling, the
cache-friendly handling of lowprio, batch-like, CPU-bound, 100%
noninteractive tasks. The new SCHED_BATCH scheduler policy implements both
features.

the existing SCHED_IDLE patches floating around, despite their simplicity,
had one major flaw that prevented their integration into the scheduler: if
an unpriviledged SCHED_IDLE process uses normal kernel functionality,
which happens to grab a critical kernel resource such as the root
directory's semaphore, and schedules away still holding the semaphore,
then there is no guarantee that the task will run again in any
deterministic amount of time - keeping the critical resource potentially
forever - deadlocking every other process that attempts to use that
critical resource. This property, while being a source for soft lockups
even during ordinary use, also makes SCHED_IDLE an easy DoS exploit.

as the size of the patch suggests, the safe solution is not simple. The
basic concept is the identification of user-space preemption via a special
scheduler upcall: one safe point to delay a task's execution indefinitely
is when the task is preempted in pure user-space mode - if this happens
then the lowlevel kernel entry code calls the schedule_userspace()
function, instead of schedule(). In every other case the task needs to
stay in the 'normal' scheduler queues, to guarantee prompt processing of
kernelspace code. Furthermore, such batch-mode tasks need to be scheduled
if they get a signal delivered - otherwise it would not be possible to eg.
kill them.

other properties: SCHED_BATCH also triggers much longer, batch-like
timeslices - the default SCHED_BATCH timeslice is 1.5 seconds. Nice values
still have a meaning for SCHED_BATCH processes as well - they determine
the relative percentage of idle CPU time allocated to SCHED_BATCH
processes. If the SCHED_BATCH process is in kernel-mode then the nice
value is used as the normal priority when preempting (or not preempting)
other, non-SCHED_BATCH processes.

put in another way: whenever a SCHED_BATCH process is in kernel-mode, it's
"elevated" into the SCHED_NORMAL priority domain - which guarantees timely
execution of kernel-space code. When the SCHED_BATCH process is executing
user-space code then it can be put into the batch-queue, and can be
delayed indefinitely.

Timeslice distribution is a modified/simplified version of SCHED_NORMAL
scheduling: SCHED_BATCH processes are scheduled in a roundrobin way,
timeslices are distributed based on the nice value. SCHED_BATCH tasks that
use up their timeslices get suspended until all other SCHED_BATCH tasks on
that CPU exhaust their timeslices - at which point a new turn begins.
SCHED_NORMAL, SCHED_RR and SCHED_FIFO tasks preempt SCHED_BATCH processes
immediately. All this functionality is implemented in an O(1) way. (The
interactivity estimator is active for SCHED_BATCH processes as well - this
has an effect if the task is in kernelspace mode. This also makes sure
that no artificial priority boost can be achieved by switching in/out of
SCHED_BATCH mode.)

on SMP there are per-CPU batch queues - which enables the use of hundreds
or thousands of SCHED_BATCH processes, if desired. A new, independent
load-balancer is used to distribute SCHED_BATCH processes: SCHED_BATCH
processes will populate CPUs depending on the CPU's "10 seconds history of
idleness". The more idle a CPU, the more SCHED_BATCH processes it will
handle. The weighting is done in a way to make the global distribution of
SCHED_BATCH timeslices fair. The load-balancer also honors caching
properties and tries to reduce unnecessery bouncing of SCHED_BATCH
processes. (The balancing, like in the SCHED_NORMAL case, is not intended
to be 100% 'sharp' - some statistical fuzziness is allowed to keep
overhead and complexity down.)

(to see the SMP SCHED_BATCH load-balancer in action, start up multiple
SCHED_BATCH processes on an SMP box - they populate all available CPUs
evenly. Then start up a single CPU-intensive, non-SCHED_BATCH process -
after a few seconds all SCHED_BATCH processes will migrate off to the
remaining CPUs, and the SCHED_NORMAL task will get 100% CPU time of a
single CPU.)

(design sidenote: initially i tried to integrate SCHED_BATCH scheduling
into the existing scheduler and SCHED_NORMAL balancer somehow, but gave up
on this idea. While that worked for RT scheduling, SCHED_BATCH scheduling
is quite different, and is 100% orthogonal to all the other scheduling
domains. Eg. the balancing of non-SCHED_BATCH processes *must not* be
influenced by the way SCHED_BATCH processes are distributed amongst CPUs.
The distribution of timeslices must be completely separated as well. So
since all the queues and state has to be separate, they can as well be in
separate (and simplified) data structures.)

i've also attached setbatch.c, which is a simple utility to change a given
PID's scheduling policy to SCHED_BATCH. One straightforward way of using
it is to change one shell to be SCHED_BATCH:

        ./setbatch $$

and start various commands from this SCHED_BATCH shell - all forked
children inherit the SCHED_BATCH setting.

the load generated by multiple SCHED_BATCH processes does not show up in
the load average - this is the straightforward solution to not confuse
load-average-sensitive applications such as sendmail.

the runtime performance impact of SCHED_BATCH is fairly minimal. There is
a (pretty light) branch and function call cost in the entry.S preemption
codepath. Otherwise the SCHED_BATCH code triggers in slowpaths only: eg.
when we would otherwise switch to the idle thread.

the patch was tested on x86 systems. non-x86 systems should still work
with the patch applied, but no SCHED_BATCH process will in fact be
suspended. For batch-suspension to work the architecture needs to call
schedule_userspace() instead of schedule(), when pure userspace code is
preempted.

the attached patch is against 2.5.24, it was tested on SMP and UP systems
as well, but keep in mind that this is the first version of this patch, so
some rough edges might be present. The patch can also be downloaded from
my scheduler patches homepage:

        http://redhat.com/~mingo/O(1)-scheduler/batch-sched-2.5.24-A0

bug reports, success reports, comments, suggestions are welcome,

        Ingo





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jun 30 2002 - 22:00:15 EST