Re: [RFC PATCH v5 5/5] riscv/cmpxchg: Implement xchg for variables of size 1 and 2

From: Palmer Dabbelt
Date: Thu Aug 10 2023 - 12:23:39 EST


On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@xxxxxxxxxx wrote:
On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> xchg for variables of size 1-byte and 2-bytes is not yet available for
> riscv, even though its present in other architectures such as arm64 and
> x86. This could lead to not being able to implement some locking mechanisms
> or requiring some rework to make it work properly.
> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> architectures.
> > Signed-off-by: Leonardo Bras <leobras@xxxxxxxxxx>


Hello Arnd Bergmann, thanks for reviewing!

Parity with other architectures by itself is not a reason to do this,
in particular the other architectures you listed have the instructions
in hardware while riscv does not.

Sure, I understand RISC-V don't have native support for xchg on variables of
size < 4B. My argument is that it's nice to have even an emulated version for
this in case any future mechanism wants to use it.

Not having it may mean we won't be able to enable given mechanism in RISC-V.

IIUC the ask is to have a user within the kernel for these functions. That's the general thing to do, and last time this came up there was no in-kernel use of it -- the qspinlock stuff would, but we haven't enabled it yet because we're worried about the performance/fairness stuff that other ports have seen and nobody's got concrete benchmarks yet (though there's another patch set out that I haven't had time to look through, so that may have changed).

So if something uses these I'm happy to go look closer.

Emulating the small xchg() through cmpxchg() is particularly tricky
since it's easy to run into a case where this does not guarantee
forward progress.


Didn't get this part:
By "emulating small xchg() through cmpxchg()", did you mean like emulating an
xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?

If so, yeah, it's a fair point: in some extreme case we could have multiple
threads accessing given cacheline and have sc always failing. On the other hand,
there are 2 arguments on that:

1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
also seem to rely in this mechanism for every xchg size. Another archs like csky
and loongarch use asm that look like mine to handle size < 4B xchg.
This is also something that almost no architecture
specific code relies on (generic qspinlock being a notable exception).


2 - As you mentioned, there should be very little code that will actually make
use of xchg for vars < 4B, so it should be safe to assume its fine to not
guarantee forward progress for those rare usages (like some of above mentioned
archs).

I would recommend just dropping this patch from the series, at least
until there is a need for it.

While I agree this is a valid point, I believe its more interesting to have it
implemented if any future mechanism wants to make use of this.

Thanks!
Leo