RE: x86/csum: Remove unnecessary odd handling

From: H. Peter Anvin
Date: Sat Jan 06 2024 - 20:19:20 EST


On January 6, 2024 2:08:48 PM PST, David Laight <David.Laight@xxxxxxxxxx> wrote:
>From: Linus Torvalds
>> Sent: 05 January 2024 18:06
>>
>> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@xxxxxxxxxx> wrote:
>> >
>> > Interesting, I'm pretty sure trying to get two blocks of
>> > 'adc' scheduled in parallel like that doesn't work.
>>
>> You should check out the benchmark at
>>
>> https://github.com/fenrus75/csum_partial
>>
>> and see if you can improve on it. I'm including the patch (on top of
>> that code by Arjan) to implement the actual current kernel version as
>> "New version".
>
>Annoyingly (for me) you are partially right...
>
>I found where my ip checksum perf code was hiding and revisited it.
>Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp'
>did an adc every clock it isn't happening for me now.
>
>I'm only measuring the inner loop for multiples of 64 bytes.
>The code less than 8 bytes and partial final words is a
>separate problem.
>The less unrolled the main loop, the less overhead there'll
>be for 'normal' sizes.
>So I've changed your '80 byte' block to 64 bytes for consistency.
>
>I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell
>(adc takes two clocks - although adc to alternate regs is one clock
>on sandy bridge).
>My test system is an i7-7700, I think anything from broadwell (gen 4)
>will be at least as good.
>I don't have a modern amd cpu.
>
>The best loop for 256+ bytes is an adxc/adxo one.
>However that requires the run-time patching.
>Followed by new kernel version (two blocks of 4 adc).
>The surprising one is:
> xor sum, sum
> 1: adc (buff), sum
> adc 8(buff), sum
> lea 16(buff), buff
> dec count
> jnz 1b
> adc $0, sum
>For 256 bytes it is only a couple of clocks slower.
>Maybe 10% slower for 512+ bytes.
>But it need almost no extra code for 'normal' buffer sizes.
>By comparison the adxc/adxo one is 20% faster.
>
>The code is doing:
> old = rdpmc
> mfence
> csum = do_csum(buf, len);
> mfence
> clocks = rdpmc - old
>(That is directly reading the pmc register.)
>With 'no-op' function it takes 160 clocks (I-cache resident).
>Without the mfence 40 - but pretty much everything can execute
>after the 2nd rdpmc.
>
>I've attached my (horrid) test program.
>
> David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>Registration No: 1397386 (Wales)

Rather than runtime patching perhaps separate paths...