Re: OS Masters

From: Andrew Morton (andrewm@uow.edu.au)
Date: Wed Apr 19 2000 - 21:19:26 EST

Next message: Matthew Hanselman: "portmap hangs on shutdown with 2.3.99-pre5"
Previous message: Paul Mackerras: "Re: Serious question, only yes or no needed."
In reply to: Mel: "OS Masters"
Next in thread: Sean Hunter: "Re: OS Masters"
Reply: Sean Hunter: "Re: OS Masters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mel wrote:
>
> Hello,
>
> My name is Mel Gorman. I'm currently studying final year computer systems
> in University of Limerick, Ireland. I have the opportunity to study a
> masters (2 years) in the college next year on operating systems and the
> direction of the masters is largely up to myself.

Hello, Mel.

I'm going to suggest a quite different tack: correctness, not features.
Less glamorous, more useful and probably a greener research pasture.

Correct me if I'm wrong, but all the projects which have been suggested
thus far have already been done elsewhere.

Linux built its rep on the 2.0.3x series and it's my seat-of-the-pants
impression that its reliability has drifted off a bit since then.
Increased code size, increased complexity, finer-grained locking, loss
of some original developers, lack of organised QA process, etc.

What can you do to find bugs in the kernel? There are two macro
approaches: static analysis and dynamic.

Static analysis would involve some sort of correctness proof of certain
problem areas such as locking. There was some talk about this a few
months ago, led by Ingo Molnar. This is definitely not my area - too
academic :) This one is tough.

Dynamic analysis would involve poking new stuff into the kernel and
then running it, see what happens. There is a _lot_ you can do here,
and I suspect it would be more useful for Linux (if not for your
grades) than static analysis. Ideas:

- Write nasty torture-test programs, run them.

- Change all the resource allocation layers such as kmalloc(),
get_free_page(), etc so they return failures under controlled,
_repeatable_ conditions. Run the kernel. Report lots of bugs.

- Do the same with I/O functions: for example, change a block device
driver so that it reports errors under controlled conditions.

- The reason timing races are so hard to find is that they are often
  very short. Find a clever way to generically increase the duration
  of timing windows so that bugs are exposed. For example, you could
  use the check-memory-usage hooks (described below) to make one CPU
  run really slowly - make it do a udelay(1000) every few instructions,
  see what happens.

- Pretty simple: do an audit.

- Make the kernel's core functions self-checking. Step one is to
  work out what all the subtle, secret preconditions are for all the
  functions (the global lock should be held, interrupts should be
  disabled, the sk_buff_head should be locked, etc). Step two is to
  fill the kernel with assertions which check that these rules are
  really being observed.

- Deadlock detection: if a spinlock can be claimed from both process
  and interrupt context then make sure that the process-context claim
  is always done with spin_lock_irqsave() or with interrupts disabled.
  (Something like that...) Lots of different scenarios here.

- The holy grail. Keep track of ALL the kernel's memory accesses.
  Dynamically associate each memory access with a lock or semaphore.
  Once the association has been established by this observation, turn
  on bug-checking: detect inconsistent usage of the memory wrt locks,
  semaphores and interrupts. This is (slightly) straightforward for
  statically allocated memory (bss), but real tough for dynamically
  allocated lockable things like sk_buff_head's. Given an access to an
  address in memory, you need to divine semantic knowledge about what
  it _is_ to check how other accessors are using it. You'll need to
  bang on the compiler a bit for this one.

  A useful starting-point is gcc's little-known -fcheck-memory-usage
  option. Docs at
  http://gcc.gnu.org/ml/gcc-patches/1999-10n/msg00539.html. You'll
  need to disable the compiler's checking of stack-based variables -
  it's uninteresting for this application and has a non-fatal bug:
  http://gcc.gnu.org/ml/gcc-patches/1999-10n/msg00600.html

If you can pull this one off you can sell it to a commercial OS
vendor for ten mill :)

- Other things I've forgotten about :( Bug me if youre interested.
  In general, troll the mail archives, identify consitent problem types
  and then find a way to change the kernel (or compiler) so that
  remaining problems of this type are accentuated/exposed.

One problem with all these approaches is coverage: there is no way you
will be able to test more than half of the kernel codebase. But
perhaps, if you manage to find a shower of bugs, others may be
persuaded to use your tools on other parts of the kernel.

-- -akpm-

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Matthew Hanselman: "portmap hangs on shutdown with 2.3.99-pre5"
Previous message: Paul Mackerras: "Re: Serious question, only yes or no needed."
In reply to: Mel: "OS Masters"
Next in thread: Sean Hunter: "Re: OS Masters"
Reply: Sean Hunter: "Re: OS Masters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:16 EST