Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

From: Linus Torvalds
Date: Wed Oct 18 2023 - 16:22:45 EST


On Wed, 18 Oct 2023 at 12:33, Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> This pach works for me:

Looks fine.

But you actually bring up another issue:

> BTW: I also don't understand the comment from include/linux/smp.h:
>
> /*
> * Allow the architecture to differentiate between a stable and unstable read.
> * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
> * regular asm read for the stable.

I think the comment is badly worded, but I think the issue may actually be real.

One word: rematerialization.

The thing is, turning inline asm accesses to regular compiler loads
has a *very* bad semantic problem: the compiler may now feel like it
can not only combine the loads (ok), but also possibly rematerialize
values by re-doing the loads (NOT OK!).

IOW, the kernel often has very strict requirements of "at most once"
behavior, because doing two loads might give different results.

The cpu number is a good example of this.

And yes, sometimes we use actual volatile accesses for them
(READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general,
and are much too strict. Not only does gcc generally lose its mind
when it sees volatile (ie it stops doing various sane combinations
that would actually be perfectly valid), but it obviously also stops
doing CSE on the loads (as it has to).

So the "non-volatile asm" has been a great way to get the "at most
one" behavior: it's safe wrt interrupts changing the value, because
you will see *one* value, not two. As far as we know, gcc never
rematerializes the output of an inline asm. So when you use an inline
asm, you may have the result CSE'd, but you'll never see it generate
more than *one* copy of the inline asm.

(Of course, as with so much about inline asm, that "knowledge" is not
necessarily explicitly spelled out anywhere, and it's just "that's how
it has always worked").

IOW, look at code like the one in swiotlb_pool_find_slots(), which does this:

int start = raw_smp_processor_id() & (pool->nareas - 1);

and the use of 'start' really is meant to be just a good heuristic, in
that different concurrent CPU's will start looking in different pools.
So that code is basically "cpu-local by default", but it's purely
about locality, it's not some kind of correctness issue, and it's not
necessarily run when the code is *tied* to a particular CPU.

But what *is* important is that 'start' have *one* value, and one
value only. So look at that loop, which hasically does

do {
.. use the 'i' based on 'start' ..
if (++i >= pool->nareas)
i = 0;
} while (i != start);

and it is very important indeed that the compiler does *not* think
"Oh, I can rematerialize the 'start' value".

See what I'm saying? Using 'volatile' for loading the current CPU
value would be bad for performance for no good reason. But loading it
multiple times would be a *bug*.

Using inline asm is basically perfect here: the compiler can *combine*
two inline asms into one, but once we have a value for 'start', it
won't change, because the compiler is not going to decide "I can drop
this value, and just re-do the inline asm to rematerialize it".

This all makes me worried about the __seg_fs thing.

For 'current', this is all perfect. Rematerializing current is
actually better than spilling and reloading the value.

But for something like raw_smp_processor_id(), rematerializing would
be a correctness problem, and a really horrible one (because in
practice, the code would work 99.9999% of the time, and then once in a
blue moon, it would rematerialize a different value).

See the problem?

I guess we could use the stdatomics to try to explain these issues to
the compiler, but I don't even know what the C interfaces look like or
whether they are stable and usable across the range of compilers we
use.

Linus