Re: RFC for a new Scheduling policy/class in the Linux-kernel

From: James H. Anderson
Date: Tue Jul 14 2009 - 12:27:09 EST



Hi all,

I've been reading these email messages for a few days
now and wanted to make some comments... maybe what I
have to say will be useful to you... maybe not.

First, I thought I should introduce myself because I
think some of you may not know me. I lead the LITMUS^RT
project at UNC (http://www.cs.unc.edu/~anderson/litmus-rt/).
I think it is important to mention this because the goals
of our LITMUS^RT-related research mean that I'm coming at
all of this from a rather different place than (I think)
most of you. The main goal of our research has been to
identify those real-time scheduling and synchronization
algorithms that work best on multicore platforms. We
define "work best" in terms *schedulability* with *real
overheads* from *real implementations* considered. There
are two important aspects of this that will influence my
comments (so take them with a grain of salt):

1. By emphasizing schedulability, we are clearly defining
"real-time" to mean predictable, and not real-fast.

2. We use Linux as a platform for implementing the algorithms
we test and getting overheads. It has never been among our
goals to actually change mainline Linux (although this may become
a goal in the future). This means that we haven't focused too
much on legacy-related hindrances. I'm using the word
"hindrances" with intention. I think in particular anything that
is POSIX-based is going to be fraught with problems in a multicore
world. Such legacy issues seem to be at the heart of many of
the problematic scenarios you are discussing. Anyway, maybe
these things are a fact of life.

The first email in this thread that I was cc'ed on talked about
implementing global EDF, but the discussion since has been
entirely about synchronization issues. To the best of my
knowledge, the only published real-time locking protocol that
can be used under global EDF (and many other algorithms) without
restricting critical sections (like not allowing certain resources
to be accessed in a nested manner) is a protocol we designed called
the flexible multiprocessor locking protocol (FMLP). Please see
the following papers at http://www.cs.unc.edu/~anderson/papers.html:

B. Brandenburg and J. Anderson, "Reader-Writer Synchronization for
Shared-Memory Multiprocessor Real-Time Systems", Proceedings of the
21st Euromicro Conference on Real-Time Systems, pp. 184-193, July 2009.

B. Brandenburg and J. Anderson, "A Comparison of the M-PCP, D-PCP, and
FMLP on LITMUSRT", Proceedings of the 12th International Conference
on Principles of Distributed Systems, pp. 105-124, December 2008.

B. Brandenburg and J. Anderson, "An Implementation of the PCP, SRP,
D-PCP, M-PCP, and FMLP Real-Time Synchronization Protocols in LITMUSRT",
Proceedings of the 14th IEEE International Conference on Embedded and
Real-Time Computing Systems and Applications, pp. 185-194, August 2008.

B. Brandenburg, J. Calandrino, A. Block, H. Leontyev, and J. Anderson,
"Real-Time Synchronization on Multiprocessors: To Block or Not to
Block, to Suspend or Spin?", Proceedings of the 14th IEEE Real-Time
and Embedded Technology and Applications Symposium, pp. 342-353, April 2008.

A. Block, H. Leontyev, B. Brandenburg, and J. Anderson, "A Flexible
Real-Time Locking Protocol for Multiprocessors", Proceedings of the
13th IEEE International Conference on Embedded and Real-Time Computing
Systems and Applications, pp. 47-57, August 2007.

(BTW, most papers on this page with "Bjoern Brandenburg" as a co-author
are LITMUS^RT-related. I've added Bjoern to the cc list.)

The underlying philosophy of our synchronization-related work is
"simplicity wins". Simplicity is obviously a good thing from an
implementation perspective, but it is also good from an *analysis*
perspective. Correctly computing task blocking terms for multiprocessor
real-time locking protocols is *really* hard. There are many
mistakes in analysis presented in the literature (in fact, I think a
new one was pointed out ECRTS two weeks ago). It is really hard
to nail down all the corner cases, and simple mechanisms have far
fewer corner cases. In our work, we first carefully write up all
blocking analysis and then have a multi-hour group meeting where
6 to 8 people review and discuss the analysis.

The FMLP is really embarrassingly simple. It's main mechanisms are:

1. Require critical sections to execute non-preemptively (I know,
non-preemption violates the "real-fast" religion -- I've been in
these discussion before :-)).

2. Execute critical sections in FIFO order (I know, FIFO violates
the "real-time" religion).

3. Deal with nested lock accesses by requiring a higher-level,
coarser lock to be acquired. While this many seem "too dumb", traces
we have taken of real systems (including Linux) indicate that
two-level nesting is not that common and deep nesting is quite rare.
Why design complicated mechanisms that are both hard to implement
and analyze to deal with something that is not so common?

In the FMLP, waiting can be implemented via either spinning or
suspension (technically, critical sections are categorized as
"short" or "long" and spinning is used for short CSs and suspension
for long ones -- this categorization is entirely up to the user).
Spinning is done non-preemptively (suspending obviously is not).

The beauty of non-preemptive, FIFO is that on an m-processor system,
blocking behavior is easy to analyze. For example, with spin-based
waiting, the blocking per critical-section access is (m-1) times the
cost of one critical section. Someone remarked in an earlier email
that we're really thinking of m being somewhat small (4, maybe 8).
As such, this is a blocking term that is really hard to beat. As I
said, simplicity wins. With suspension-based waiting, things are a
bit more complicated, but not much.

In our experience, the blocking analysis of any multiprocessor protocol
that uses priority-inheritance-related mechanisms is a nightmare. The
problem is that, to handle all the corner cases, pessimistic assumptions
have to be made. This pessimism can really add up and lead to huge
blocking terms.

This email thread has also touched on group-based scheduling. We
proposed a very simple scheme several years ago called "skipping"
in the context of Pfair scheduling that makes this easy:

P. Holman and J. Anderson, "Locking under Pfair Scheduling", ACM
Transactions on Computer Systems , Volume 24, Number 2, pp. 140-174,
May 2006.

(An earlier version in appeared in RTSS 2002).

In this approach, critical-section lengths must be known, and any
lock request that occurs when a task doesn't have sufficient
budget is simply denied -- the request is done later when that task
receives additional budget. This avoids a task in one group from
holding a lock while it is preempted (which could severely hurt
lock-requesting tasks in other groups). This scheme is really easy
to implement in conjunction with the FMLP and it doesn't require
complicated budget tracking. Its effects on blocking terms are
also easy to analyze. Thomas Nolte and colleagues (in Sweden) have
written some papers where they've used skip-based locks in
hierarchically-scheduled systems.

I'll stop rambling now. I guess my overall point is that these
legacy issues like POSIX seem to be forcing you into complex
solutions. Unless there is a way to shed this baggage, I think
it'll be a matter of where you put the complexity -- you'll never
be able to eliminate it (and again, I think complexity is *bad*).

I hope you find my comments worthwhile (and if not, sorry for
sending such a long message).

-Jim Anderson

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/