Re: a different approach to scheduling issues

John Regehr (
Wed, 30 Sep 1998 12:00:08 -0400 (EDT)

> What it would take, I think, is a careful analysis of the scheduler
> "interface", that is, all the functions that allow threads to be
> queued and dequeued, as well as schedule() itself. Once identified in
> this way (and if necessary, some cleanup done to create such a defined
> interface, but I'm not sure this is needed), it should be relatively
> simple to allow the module loading process to load a new version of
> the scheduler.

Interesting! I'm thinking about taking on something like what you
describe for my PhD work - I'll include a very rough draft of my proposal
here. This is relevant to this list since Linux is one of the platforms
I'm interested in working on.

I did scheduling work at Microsoft Research last summer, and my ideas
came out of that. No, I'm not some random MS drone - I'm a huge Linux
fan, and have been since 0.96c :).



The proposed work is to incorporate more flexible CPU scheduling into
existing operating systems, using dynamically loadable scheduler
modules. We plan to implement a kernel scheduler interface that
allows hierarchical stacking of loaded schedulers, and also to
automate analysis of the interactions between these schedulers, so
that we can make guarantees about the entire system based on what we
know about the behavior of individual schedulers.

There are many applications that can either perform poorly or fail to
work at all when scheduled by the default throughput-optimized
timesharing scheduler found in standard operating systems. Some
domain-specific schedulers that we would like to load on demand are:
various kinds of real-time schedulers, cluster coschedulers, SMP
coschedulers, idle-time batch schedulers, and alternate time-sharing

To motivate hierarchies of schedulers, we present two examples:

1. The Rialto/NT scheduler is a prototype real-time scheduler for
Windows NT. In a system with loadable scheduler modules, the
real-time scheduler would be at the root of the scheduling
hierarchy. It schedules activities, which are collections of threads
that have reserved CPU time. When an activity is scheduled, a
round-robin scheduler selects the actual thread to run. When no
activity needs to be scheduled, the scheduler donates its time to the
default NT timesharing scheduler. When no NT thread is runnable, the
NT scheduler donates its time to a batch scheduler which gives very
long timeslices to background jobs in order to reduce inefficiency
introduced by extra context switches.

Rialto/NT ______
/ | \
activity 1 activity 2 default
RR scheduler RR scheduler NT scheduler
/ | \ / \ / | \
t1 t2 t3 t4 t5 t6 t7 batch
/ \
t8 t9

2. Parallel programs can perform poorly in the presence of
multiprogramming if there is no coordination between schedulers on
machines in the cluster. This problem is exacerbated by user-level
network protocols which often poll for messages rather than blocking
in the kernel; the kernel scheduler is therefore unaware if the
parallel program is getting work done, or if it is spinning while it
waits for a message from a thread that is currently descheduled on
another machine. In the following scenario, a cluster coscheduler
coordinates jobs between machines, and application specific schedulers
(fifo or round robin) schedule threads within each application.

/ | \
parallel parallel default
job 1 job 2 NT scheduler
/ \ / | \ |
t1 t2 t3 t4 t5 t6

Schedulers conforming to the proposed interface will be
source-compatible between operating systems. The goal of this is the
same as the goal of the Uniform Driver Interface, which allows device
drivers to be binary-compatible between Unixes running on x86
machines. That is, we would like as many system components as
possible, including schedulers, to be written once and used many
times, rather than being reimplemented on every platform.

We picture the interface for a portable scheduler as containing the
following elements: schedulers need to be able to receive arbitrary
messages from threads in the system and from other schedulers, they
need to be notified of available processors, preempted processors, and
blocking and unblocking threads, they need to be able to donate time
either to threads or to other schedulers, and they need various
support functions such as synchronization, memory allocation, loading
and unloading, and timers. Every thread in the system will belong to
at least one scheduler. Messages from threads to schedulers will
probably be implemented by adding a system call to the OS; messages
between schedulers are lightweight, and will be implemented as
function calls or events placed in shared-memory queues. If the
interface we have proposed is the entire API that a scheduler needs,
then it follows that scheduler code can be portable between operating

We want to implement our schedulers in the kernel because we believe
this will result in a more efficient system. From the interface above
we see that there are many kinds of events that schedulers need to
know about; if every such notification results in interprocess
communication, system efficiency will suffer because of the extra
context switches.

However, there is certainly no reason that our scheduler interface
cannot be used for user-space schedulers as well as in-kernel
schedulers. Supporting this would simply require a second, IPC-based
implementation of the scheduler interface. User-level schedulers
should be easier to debug than in-kernel schedulers, and if they
reside in the same address space as the threads that they are
scheduling, then it is possible that they could reduce the number of
context switches rather than increasing them. We plan to investigate
the question of how to put loadable schedulers in the right place.

The existence of a hierarchy of schedulers motivates the second part
of the proposed work, which is to provide a framework for deriving the
properties of a group of schedulers from what we know about the
individual schedulers. We want the system to have a model of the
available CPU resources, and then for loadable schedulers to be
accompanied by a description of themselves in some sort of system
description language. Before a scheduler is added to the hierarchy,
the system will verify that it is possible for the scheduler to
fulfill its requirements. For example, if a real-time scheduler is
loaded in such a way that it will be running in the idle-time of a
time-sharing scheduler, then the system should probably not allow the
scheduler to be loaded because it will be unable to provide meaningful
guarantees to any threads that it schedules. Another easy example is
the scheduler at the root of the hierarchy - because it receives the
undivided attention of the CPU, if it works at all it will be able to
guarantee resources to threads that it schedules.

Some open questions are: what model should we use for the underlying
resources, what model should we use for the schedulers, do we want a
scheduler hierarchy for individual CPUs or is there a single hierarchy
for the entire system, and are there any OS-dependent structures that
the OS-independent schedulers need to be aware of? For example, Linux
has process groups, NT has job objects, and Rialto has activities -
these are all entities containing multiple threads that the scheduler
should be aware of. Also, what is the language that schedulers use to
tell the system about their requirements? Maybe they should reserve a
percentage of the CPU? If so, what do we do about schedulers such as
cluster coschedulers that have unpredictable timing requirements?

Rialto/NT was one of the original motivations for this work. Although
it is a fairly complex piece of code, its interface to NT is thin. It
is currently not a loadable scheduler - it is implemented as a set of
patches to NT 5. This will be one of the first schedulers to be
modified to use the portable interface; after this is done, we want to
implement the interface in a unix kernel, Linux most likely, to show
that schedulers can indeed be portable.

This work falls under the umbrella of extensible operating systems.
Although security is generally a concern in this area, we currently
treat loadable schedulers as trusted modules, just like the rest of
the system software.

So, the overall system structure looks like this:

unloaded new job
scheduler |
| |
v v
Scheduler <---------------> admission
analysis ^ control
| | |
| feedback |
v ^ v
loaded | running
schedulers >--------------< jobs

John Regehr | |
grad student | Department of Computer Science | University of Virginia 

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to Please read the FAQ at