Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]

From: Andrea Arcangeli (andrea@suse.de)
Date: Tue Apr 24 2001 - 08:59:23 EST


On Tue, Apr 24, 2001 at 02:07:47PM +0100, David Howells wrote:
> It was my implementation that triggered it (I haven't tried it with yours),
> but the bug occurred because the SUBL happened to make the change outside of
> the spinlocked region in the slowpath at the same time as the wakeup routine
> was running on the other CPU.

That problem seems not to apply to my slowpath algorithm.

> I'll have a look at the sequence and make sure that it does actually apply to
> your implementation. It may not... but it doesn't hurt to check.

indeed, I'd be glad if you could verify of course, thanks!

> > yes of course, you use rwsem_cmpxchgw that is unnecessary.
>
> Actually, I use this to try and avoid the following loop that you've got in
> your code:
>
> > +static void __rwsem_wake(struct rw_semaphore *sem)
> ...
> > + again:
> > + count = rwsem_xchgadd(-wait->retire, &sem->count);
> > + if (!wake_read && (count & RWSEM_READ_MASK)) {
> > + count = rwsem_xchgadd(wait->retire, &sem->count);
> > + if ((count & RWSEM_READ_MASK) == 1)
> > + goto again;
>
> I now only have that loop in the rwsem_up_write_wake() function.

I don't care about the slow path performance, mainly if I to make it faster I have
to slowdown up_write with an cmpxchg too.

> But! In mine, if __up_write()'s CMPXCHG failed, then it has also read the
> counter, which it then passes as an argument to rwsem_up_write_wake(). This
> means I can avoid the aforementioned loop in most cases, I suspect, by seeing
> if the active counter was 1 at the time of the failed CMPXCHG.
>
> This also means that if a ex-writer wakes up a writer-to-be, the only atomic
> instruction performed on sem->count is the CMPXCHG in __up_write().
>
> For the ex-writer waking up readers case, we have to perform an additional
> XADD, but this must be done before anyone is woken up, else __up_read() can
> get called to decrement the count before we've incremented it. I count the
> number of things I want to wake up, adjust the count (one LOCKED ADD only) and
> then wake the batch up.
>
> You dive into a LOCKED XADD/XADD loop for each process you wake, which in the
> best case will give you one LOCKED XADD per process.

I don't dive into the loop for each process I wake, I only execute one locked
xadd for each prcess I wake (it's the de-retire logic). As I also execute
1 xadd for each process that goes to sleep in schedule() in the slow path
(that's the retire logic). That's really clean beahiour of my slow path I think.

> Looking again at rwsem_up_read_wake()... I can't actually eliminate the
> CMPXCHG there because the active count has already been decremented, and so
> will need to be incremented again prior to a wake-up being performed. However,
> if the increment was performed and someone else had incremented the count in
> the meantime, we have to decrement it again... but this can cause a transition
> back to zero, which we have to check for... and if that occurred...

I think what you describe is similar of what I'm doing in my algorithm (and this way
I don't need any chmxchg in both slow and fast path).

> You get the idea, anyway.
>
> Oh yes... this idea should be faster on SPARC/SPARC64 and IA64 which don't
> have useful XADD instructions (FETCHADD can only use small immediate values),
> only CMPXCHG/CAS are really useful there.
>
> > I guess I'm faster because I avoid the pipeline stall using "+m" (sem->count)
> > that is written as a constant, that was definitely intentional idea.
>
> "+m" doesn't avoid the stall. That's just shorthand for:
>
> : "=m"(sem->count) : "m"(sem->count)
>
> which is what mine has.

You have it yes, but you don't use it ;). While I use it so my compiler can
choose if to put dereference the constant in the incl/etc or to put the (%%eax)
that you hardcoded.

> "a+" luckily causes the avoidance by saying EAX gets clobbered, causing the
> compiler to save via an additional register. Note that the compiler saves
> anyway, even if the register will be discarded after being restored - yuk!
>
> I think this one will depend on the surrounding code anyway. I suspect in some
> chunks of C mine will be faster, and in others yours will be, all depending on
> when EAX is loaded.

Ok but you save only in the slow path so the fast path should not be affected.
So doing "a" instead of "+a" shouldn't be able to hurt the fast path, it should
only help I think (that's why I changed it too).

> > My point is that when you set XADD you still force duplication of the header
> > stuff into the the asm/*
>
> If you want to use my XADD algorithm, then I think that's going to be fair
> enough for now. You have to provide some extra routines anyway.

from the arch part you "only" have to provide some extra routine (not the
headers), that's why I included <asm/rwsem_xchgadd.h> from <linux/rwsem_xchgadd.h> ;)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Apr 30 2001 - 21:00:11 EST