Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")

From: Linus Torvalds
Date: Tue Jun 05 2018 - 19:27:41 EST


On Tue, Jun 5, 2018 at 4:20 PM Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:
>
> This is Broadwell Xeon E5-2620 v4.
> Which is somewhat strange indeed because it should be modern enough.

Yeah, odd.

Here's the benchmark I used:

#define SIZE 4068

int main(int argc, char **argv)
{
int i;
unsigned char buffer[SIZE], *p;

for (i = 0; i < 1000000; i++)
asm volatile(
"1: movq %[zero],(%[mem]); addq %[eight],%[mem]; decl
%[count]; jne 1b"
: [mem] "=r" (p)
: [zero] "i" (0l), [eight] "i" (8l),
"0" (buffer), [count] "r" (SIZE/8));
}

where you can change that "i" for [zero] and [eight] to be "r" to get
the register version.

I just timed it, because I'm lazy and perf seemed to be overkill.

It might be some very specific loop buffer issue or something.

Or maybe my benchmark above is broken, I didn't really verify that the
end result was any good (I just did an objdump to verify the asm code
superficially).

Linus