Re: [patch] new spinlock variant, spinlock-2.3.30-A4

Ingo Molnar (mingo@chiara.csoma.elte.hu)
Tue, 30 Nov 1999 23:18:25 +0100 (CET)


On 30 Nov 1999, Dimitris Michailidis wrote:

> My understanding is that Processor Ordering does not guarantee this at
> all. It guarantees that if CPU i observes a store that occured on CPU
> j at time t then i can also observe all stores that occured on j
> before t. OTOH, a store on j at time t may be observed by i at an
> arbitrary point after t. So in the code above a store by CPU 0, say,
> to (%%esi) is observed immediately by CPU 0 itself, but you don't know
> when CPU 3 will see it. You just know that it will see it before it
> sees any stores on 0 after the one to (%%esi).

this is all we need. Note that the lock acquire code first stores to a
byte, then observes the _whole_ word. [lets just assume the 4-CPU variant
for simplicity.] If the whole 32-bit word is read, and found to be equal
to 'empty except me' value, then we have observed all other bytes in that
word as well and the whole value of the 32-bit word is 'comitted state'.

If any other CPU observes this word, it can see two versions of this
cacheline, the A) 'before' state and the B) 'after' state. A) If it has
observed the 'before' state and did a store _before_ it has observed this
state, then we have a contradiction because the store must have been seen
by the other CPU in the first place. This means that A) cannot happen. B),
If another CPU observes this word in the 'after' state then there is no
contradiction, it will see the byte-write comitted.

how is this mechanizm implemented? My understanding is that there are
several 'fill and store buffers' in a P6 class CPU, which fill/store
buffers actively snoop for other CPUs to advertise their to-be-comitted
stores. (remember a CPU is doing many things at once so this is all
happening in parallel) If any CPU has advertised it's store and nobody has
any problems with it (within a timeout) then the cacheline becomes
comitted. If there is any conflict (ie. another CPU has the very same
cacheline as well, with not yet comitted stores, or relied-on speculative
reads), then 1) stores do a voting and the winner commits, the losers
throw away + retry their stores and refill the cacheline 2) stores
conflicting with speculative reads are winners unconditionally,
speculative reads (and all other partly executed instructions relying on
the results of those speculative reads) get thrown away unconditionally
and the cacheline is re-read. (Maybe there is some additional arbitration
and round-robin priority so that fairness is guaranteed and every CPU
advances sooner or later - i dont know.)

why is this mechanizm cool? In 99.99% of the cases there is no conflict
and every processor is acting and performing as a fully speculative weakly
ordered uniprocessor CPU. If any conflict arises there is some kind of
rollback mechanizm in place that restores original register contents. This
is 'opportunistic execution', ie. the basic assumption is that everything
is conflict-less.

> Plus, under contention the above lock can force all contending CPUs to
> go into the slow path.

A contended lock by definition has many CPUs in the slow path. The real
problem is (and you are right that there is a problem) that with a naive
slow path contended CPUs might never get out of the slow path (because the
'read and modify' cycle is not atomic in the slow path), but this is a
detail which i dont believe is a real problem and can be dealt with.

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/