Re: [GIT PULL] x86 cleanups for v5.7

From: Linus Torvalds
Date: Thu Apr 02 2020 - 13:00:50 EST


On Thu, Apr 2, 2020 at 6:40 AM Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> Btw, looking at this:
>
> https://reviews.llvm.org/rG50cac248773
>
> and talking to a gcc guy (CCed), it should be also relatively easy to do
> the fallthrough variant in gcc too so you could open a feature request
> for that in the gcc bugzilla.

I absolutely want to have that in gcc too (I've mentioned it a couple
of times to people like rth), but I wanted to have more than just an
abstract test-case for it.

Sure, I could attach my current patch to the bug report, but honestly,
if I was a compiler guy, I'd care a lot less about a "hey, you have to
apply this patch that won't even _compile_ for you right now to a tree
you don't really care about that much" kind of bug-report than a "hey.
look, upstream Linux already does this, and with clang I get this
code, and with gcc it's much worse".

And to be honest, while I love the asm goto output, it's not *that*
noticeable. Particularly since we don't have a lot. This is the inner
loop of "strncpy_from_user" with the feature (and clang, of course):

90: mov (%r15,%r13,1),%rdx
94: mov %rdx,(%r12,%r13,1)
98: mov %rdx,%rsi
9b: not %rsi
9e: and %rax,%rsi
a1: add %rcx,%rdx
a4: and %rsi,%rdx
a7: jne ec <strncpy_from_user+0xec>
a9: add $0x8,%r13
ad: add $0xfffffffffffffff8,%rbx
b1: cmp $0x7,%rbx
b5: ja 90 <strncpy_from_user+0x90>

And all the jumps are just for testing if there was a zero in there
(the jne in the middle) or testing for the length of the remaining
space (that "cmp $7;ja" at the end).

This is the same thing with gcc (and without asm goto, of course):

91: lea (%rcx,%rdi,1),%rdx
95: mov %rcx,0x0(%r13,%rax,1)
9a: not %rcx
9d: and %rcx,%rdx
a0: mov %rdx,%rcx
a3: and %rsi,%rcx
a6: jne 108 <strncpy_from_user+0x108>
a8: sub $0x8,%rbx
ac: add $0x8,%rax
b0: cmp $0x7,%rbx
b4: jbe 12d <strncpy_from_user+0x12d>
b6: mov %r8d,%edx
b9: mov 0x0(%rbp,%rax,1),%rcx
be: test %edx,%edx
c0: je 91 <strncpy_from_user+0x91>

and the only real difference in that inner loop from the asm goto is
that the gcc code has three extra instructions (don't ask me why gcc
decided to cache the value 0 in %r8, that just looks stupid):

b6: mov %r8d,%edx
...
be: test %edx,%edx
c0: je 91 <strncpy_from_user+0x91>

(Ok, the code sequence looks completely different because the two
compilers also end up generating the function differently: gcc does
the user space load at the end of the loop while clang does it at the
top. That's probably related to the fact that gcc has to generate that
extra jump anyway, and decided to make that the loop finishing jump).

So realistically, it doesn't make a huge difference. It's a bit more
noticeable when you have the "multiple unsafe_get_user()s in a row"
pattern, but we don't really have that (we have lots of "multiple
unsafe_put_user() in a row").

Of course, one reason we don't have that pattern is that it generates
nasty code with gcc (exactly because of the extra test for each
access).

But while I love looking at small things like this, and I'd like to
have all compilers support it, I have to admit that it's not likely to
really _matter_ much.

Linus