Re: 2.4.7-pre9..

From: Linus Torvalds (
Date: Fri Jul 27 2001 - 12:46:44 EST

On Fri, 27 Jul 2001, Matthew Dharm wrote:
> It looks like I missed an important discussion in the torrent of e-mail
> that I receive... could someone give me the 30-second executive summary so
> I can look at what may need to change in usb-storage?

The basic summary is that we had this (fairly common) way of waiting for
certain events by having a locked semaphore on the stack of the waiter,
and then having the waiter do a "down()" which caused it to block until
the thing it was waiting for did an "up()".

This works fairly well, _but_ it has a really small (and quite unlikely)
race on SMP, that is not so much a race of the idea itself, as of the
implementation of the semaphores. We could have fixed the semaphores, but
there were a few reasons not to:

 - the semaphores are optimized (on purpose) for the non-contention case.
   The "wait for completion" usage has the opposite default case
 - the semaphores are quite involved and architecture-specific, exactly
   due to this optimization. Trying to change them is painful as hell.

So instead, I introduced the notion of "wait for completion":

        struct completion event;

        .. pass of event pointer to waker ..

where the thing we're waiting for just does "complete(event)" and we're
all done.

This has the advantage of being a bit more obvious just from a syntactic
angle about what is going on. It also ends up being slightly more
efficient than semaphores because we can handle the right expected case,
and it also avoids the implementation issue that made for the race in the
first place.

Switching over to the new format is really trivial:

 struct semaphore -> struct completion
 init_MUTEX_LOCKED -> init_completion
 down() -> wait_for_completion()
 up() -> complete()

and you can in fact maintain 2.2.x compatibility by just having a 2.2.x
compatibility file that does the reverse mappings.

In case anybody cares, the race was that Linux semaphores only protect the
accesses _inside_ the semaphore, while the accesses by the semaphores
themselves can "race" in the internal implementation. That helps make an
efficient implementation, but it means that the race was:

        cpu #1 cpu #2

        down(&sem); up(&sem);
                                        wake_up(&sem.wait) /*BOOM*/

where the waker still touches the semaphore data structure after the
sleeper has become happy with it no longer being locked - and free'd the
data structure by virtue of freeing the stack.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Tue Jul 31 2001 - 21:00:34 EST