Re: RFC for a new Scheduling policy/class in the Linux-kernel

From: Peter Zijlstra
Date: Tue Jul 14 2009 - 12:31:55 EST



Hi Jim,

On Tue, 2009-07-14 at 11:19 -0400, James H. Anderson wrote:
> The first email in this thread that I was cc'ed on talked about
> implementing global EDF,

Hendrik says that its a modified Modified-Least-Laxity-First, so
something like M^2-LLF, and I cc'ed you (and I meant to add Bjorn too,
but now see I failed to actually do so) in the hope that you would have
some thoughts on his scheduling algorithm, since that is what you do a
lot of ;-)

It could well be he modified M-LLF into G-EDF, but I'm behind in my
reading here, so I'll have to leave that up to others.

> but the discussion since has been entirely about synchronization issues.

Right, this seems to be a very hot topic.

> To the best of my
> knowledge, the only published real-time locking protocol that
> can be used under global EDF (and many other algorithms) without
> restricting critical sections (like not allowing certain resources
> to be accessed in a nested manner) is a protocol we designed called
> the flexible multiprocessor locking protocol (FMLP).

> The FMLP is really embarrassingly simple. It's main mechanisms are:
>
> 1. Require critical sections to execute non-preemptively (I know,
> non-preemption violates the "real-fast" religion -- I've been in
> these discussion before :-)).

Lets simply state that industry would like to be able to deal with very
short deadlines :-)

> 2. Execute critical sections in FIFO order (I know, FIFO violates
> the "real-time" religion).

For those locks that are indeed non-preempt busy wait (raw_spinlock_t)
we generally use a FIFO-fair implementation (x86 at least, but I thought
several other platforms were looking at, or have already, implemented a
similar spinlock).

> 3. Deal with nested lock accesses by requiring a higher-level,
> coarser lock to be acquired. While this many seem "too dumb", traces
> we have taken of real systems (including Linux) indicate that
> two-level nesting is not that common and deep nesting is quite rare.
> Why design complicated mechanisms that are both hard to implement
> and analyze to deal with something that is not so common?
>
> In the FMLP, waiting can be implemented via either spinning or
> suspension (technically, critical sections are categorized as
> "short" or "long" and spinning is used for short CSs and suspension
> for long ones -- this categorization is entirely up to the user).
> Spinning is done non-preemptively (suspending obviously is not).
>
> The beauty of non-preemptive, FIFO is that on an m-processor system,
> blocking behavior is easy to analyze. For example, with spin-based
> waiting, the blocking per critical-section access is (m-1) times the
> cost of one critical section. Someone remarked in an earlier email
> that we're really thinking of m being somewhat small (4, maybe 8).
> As such, this is a blocking term that is really hard to beat. As I
> said, simplicity wins. With suspension-based waiting, things are a
> bit more complicated, but not much.

Our problems are many (and I think you know about most if not all).

- Linux is _huge_:
o and while we seem to do a reasonable job of
getting people to cope with basic concurrency, asking them to
take RT constraints into account when doing so will surely be too
much -- worse, we might not ever convince people its a worthy
goal to begin with.

o auditing each and every lock in the kernel simply isn't possible.

- Linux needs to provide isolation:
o we must assume userspace is hostile and try to minimize the
impact of such tasks on others.

- POSIX:
o we're stuck with the legacy here, but we're looking fwd outside
of POSIX.

(I'm probably forgetting half)

> In our experience, the blocking analysis of any multiprocessor protocol
> that uses priority-inheritance-related mechanisms is a nightmare. The
> problem is that, to handle all the corner cases, pessimistic assumptions
> have to be made. This pessimism can really add up and lead to huge
> blocking terms.

Right, Ted holds similar opinions.

Practically though, active Priority Inheritance has enormous benefits
for us simple people that need to get things done :-)

It has allowed us to convert this huge mass of code into something that
is real-time enough for a lot of practical uses, including industrial
robots and the like without the need to analyze each and every lock out
there.

Now I fully appreciate the theoretical trouble full PI and its ilk
gives, so maybe we can do something with lockdep/stat to track and
qualify the lock nesting depths and critical section lengths and use
those to improve upon the situation.

At worst we could use the data to qualify the usability of certain
system calls/traps wrt RT applications.

Also, can't we employ the PI framework to act as a debug/help-guide in
validating/figuring out the PCP levels?

On that, how does the priority ceiling/protection protocol extend to
deadline schedulers?

> This email thread has also touched on group-based scheduling. We
> proposed a very simple scheme several years ago called "skipping"
> in the context of Pfair scheduling that makes this easy:

> In this approach, critical-section lengths must be known, and any
> lock request that occurs when a task doesn't have sufficient
> budget is simply denied -- the request is done later when that task
> receives additional budget. This avoids a task in one group from
> holding a lock while it is preempted (which could severely hurt
> lock-requesting tasks in other groups). This scheme is really easy
> to implement in conjunction with the FMLP and it doesn't require
> complicated budget tracking. Its effects on blocking terms are
> also easy to analyze. Thomas Nolte and colleagues (in Sweden) have
> written some papers where they've used skip-based locks in
> hierarchically-scheduled systems.

Not granting locks when the contender doesn't have enough budget to
finish them seems like a very attractive solution, however the cost of
having to analyze the critical section seems prohibitive.

That said, we could maybe experiment with this for a few key lock sites.

Anyway, our goal (the linux-rt folks) is to make Linux a better RT OS.
I'm sure we'll never get hard enough for some people, but I'm sure we
can do better than we do today.

Furthermore we're limited by the existing legacy (both spec and code
wise), but I think we have been able to make good progress through tools
that help us, such as lockdep and the latency tracer (now ftrace), which
help us find trouble in an active manner.

And I think discussion such as these help us find new directions to
explore, so carry on!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/