RE: x86/csum: Remove unnecessary odd handling

From: David Laight
Date: Fri Jan 05 2024 - 05:41:51 EST


From: Linus Torvalds
> Sent: 05 January 2024 00:33
>
> On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Anyway, since I looked at the thing originally, and feel like I know
> > the x86 side and understand the strange IP csum too, I just applied it
> > directly.
>
> I ended up just applying my 40-byte cleanup thing too that I've been
> keeping in my own tree since posting it (as the "Silly csum
> improvement. Maybe" patch).

Interesting, I'm pretty sure trying to get two blocks of
'adc' scheduled in parallel like that doesn't work.

I got an adc every clock from this 'beast':
+ /*
+ * Align the byte count to a multiple of 16 then
+ * add 64 bit words to alternating registers.
+ * Finally reduce to 64 bits.
+ */
+ asm( " bt $4, %[len]\n"
+ " jnc 10f\n"
+ " add (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 16(%[len]), %[len]\n"
+ "10: jecxz 20f\n"
+ " adc (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 32(%[len]), %[len_tmp]\n"
+ " adc 16(%[buff], %[len]), %[sum_0]\n"
+ " adc 24(%[buff], %[len]), %[sum_1]\n"
+ " mov %[len_tmp], %[len]\n"
+ " jmp 10b\n"
+ "20: adc %[sum_0], %[sum]\n"
+ " adc %[sum_1], %[sum]\n"
+ " adc $0, %[sum]\n"
+ : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+ [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+ : [buff] "r" (buff)
+ : "memory" );

Followed by code to sort out and trailing 15 bytes.

Intel cpu (from P-II until Broadwell 5th-gen) take two clocks for 'adc'
(probably because it needs 3 inputs).
So 'adc' chains ran a lot slower than you might think.
(Clearly no one ever actually benchmarked the old code!)
The first fix made the carry output available early - so adding
to alternate registers helps. IIRC this is in Ivy/Sandy bridge.
Maybe no one cares about Ivy/Sandy bridge and Haswell any more.
AMD cpu don't have this problem.

I'm pretty sure I measured that loop with a misaligned buffer.
Measurably slower, but less than one clock per cache line.
I guess that the cache-line crossing reads get split, but you
gain most back because the cpu can do two reads/clock.

Maybe I'll sort out another patch...

I did get 15/16 bytes/clock with a similar loop that used adox/adcx
but that needed unrolling again and only works on a few cpu.
IIRC amd have some cpu that support adox - but execute it slowly!
Annoyingly you can't use 'loop' even on cpu that support adox
because it is stupidly slow on intel cpu (ok on amd).

That version is a lot of pain since it needs run-time patching.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)