Re: [GIT PULL] Ceph updates for 5.20-rc1

From: Linus Torvalds
Date: Fri Aug 12 2022 - 00:05:58 EST


On Thu, Aug 11, 2022 at 3:43 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Oh, sadly, clang does much worse here.
>
> Gcc ends up being able to not have a stack frame at all for
> __d_lookup_rcu() once that DCACHE_OP_COMPARE case has been moved out.
> The gcc code really looks very nice.
>
> Clang, not so much, and it still has spills and reloads.

I ended up looking at the clang code generation more than I probably
should have, because I found it so odd.

Our code is literally written to not need that many values, and it
should be easy to keep everything in registers.

It turns out that clang is trying much too hard to be clever in
dentry_string_cmp(). The code is literally written so that we keep the
count of remaining characters in 'tcount', and then at the end we can
generate a 'mask' from that to ignore the parts of the pathname that
are beyond the size.

So it creates the mask by just doing

mask = bytemask_from_count(tcount);

and the bytemask_from_count() is a very straightforward "take the byte
mask, multiply by 8 to get the bitmask, and then you can use that to
shift bits around and you get the byte mask:

#define bytemask_from_count(cnt) (~(~0ul << (cnt)*8))

But clang seems to decide that this "multiply by 8" all is a very
costly operation, and seems to have figured out that since we always
do everyting one word at a time, so 'tcount' is always updated in
sizeof(long), and that means that at the end 'tcount' is guaranteed to
be the original 'tcount & 7' (on a 64-bit machine).

End result: the above mask can be pre-computed before entering the loop at all.

Which is absolutely true, but adds to register pressure, and means
that you pre-compute that mask whether you actually need to or not.

But hey, even that is fine, because you can actually move the mask
outside *both* of the loops (both the hash chain *and* the inner
"check each pathname" loop, because the initial value of 'tcount' is
very much a fixed value per function call.

The computation is pretty cheap, the bigger expense is that now you
have one more thing you need to keep track of.

But on its own, that would probably still be ok, because hey, we have
extra registers. But it does make the liveness of that extra value be
quite large.

But the extra registers are then used by the fact that we have that

b = load_unaligned_zeropad(ct);

which *ALSO* has that incredibly expensive "multiply by 8" operation,
except now it's a different value that needs to be multiplied by 8,
namely the offset within an aligned long-word, which we use to fix up
any unaligned faults that take exceptions.

And since clang seems to - once again - think that multiplying by 8 is
INCREDIBLY EXPENSIVE, what it does is say "hey, I see all the places
where that base pointer is updated, and I can have my own internal
variable that tracks that value, except I do "update-by-8" there.

So now clang has created _another_ special internal variable that it
updates inside the inner loop, which tracks the bit offset of the
*aligned* part of the 'ct' variable, so that in the *extremely*
unlikely (read: never actually happens) situation where that
load_unaligned_zeropad takes an exception, it has the shift value
pre-computed.

And now clang runs out of registers, and honestly, it took me a
*loong* time to understand what the generated code was even trying to
do, because this was all so incredibly pointless and the code looked
so very very odd.

Anyway, it's amusing mostly because both of those "optimizations" that
clang did actually required some very clever compiler work.

It's just that they were both doing extra work in order to avoid an
operation that was _less_ work in the first place.

I suspect that there's some core clang optimization that goes
"addition is much cheaper than multiplication, and if we have an index
variable that is multiplied by a value, we can keep a shadow variable
of that index that is 'pre-multiplied', and thus avoid the
multiplication entirely".

It's absolute genius.

It just happens to generate really quite bad code, because multiplying
by 8 outside the loop is _so_ much cheaper than what clang ends up
generating.

On the whole, I think that any value that needs one or two ALU
operations to be calculated should *not* be seen as some prime example
of something that needs to be pre-computed in order to move it outside
the loop. The extra register pressure will make it a loss.

But admittedly we also have no way to tell clang that that exception
basically never happens, so maybe it thinks it is really important.

I'm adding clang people to the list, in case anybody wants to look at
it. The patch that started my path down this insanity is at

https://lore.kernel.org/all/CAHk-=wjCa=Xf=pA2Z844WnwEeYgy9OPoB2kWphvg7PVn3ohScw@xxxxxxxxxxxxxx/

in case somebody gets an itch to look at a case where clang generates
some very odd code.

That said, this code has been streamlined so much that even clang
can't *really* make the end result slow. Just not as good as what gcc
does, because clang tries a bit too hard and bites off a bit more than
it can chew.

Linus