Re: [PATCH] lib/int_sqrt.c: Optimize square root function

From: Peter Zijlstra
Date: Thu Jul 20 2017 - 11:28:03 EST


On Thu, Jul 20, 2017 at 01:24:49PM +0200, Peter Zijlstra wrote:
> ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt
>
> Performance counter stats for './sqrt' (10 runs):
>
> 328,415,775 cycles:u ( +- 0.15% )
> 1,138,579,704 instructions:u # 3.47 insn per cycle ( +- 0.00% )
>
> 0.088703205 seconds time elapsed

> static __always_inline unsigned long fls(unsigned long word)
> {
> asm("rep; bsr %1,%0"
> : "=r" (word)
> : "rm" (word));
> return BITS_PER_LONG - 1 - word;
> }

That is actually "lzcnt", if I used the regular fls implementation:


static __always_inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}

It ends up slightly more expensive:

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

384,842,215 cycles:u ( +- 0.08% )
1,118,579,712 instructions:u # 2.91 insn per cycle ( +- 0.00% )

0.103018001 seconds time elapsed


Still loads cheaper than pretty much any other combination.