RE: x86/csum: Remove unnecessary odd handling

From: David Laight
Date: Fri Jan 05 2024 - 18:53:18 EST


From: Linus Torvalds
> Sent: 05 January 2024 18:06
>
> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@xxxxxxxxxx> wrote:
> >
> > Interesting, I'm pretty sure trying to get two blocks of
> > 'adc' scheduled in parallel like that doesn't work.
>
> You should check out the benchmark at
>
> https://github.com/fenrus75/csum_partial
>
> and see if you can improve on it. I'm including the patch (on top of
> that code by Arjan) to implement the actual current kernel version as
> "New version".

I'd have to fix his benchmark code first :-)
You can't use the TSC unless you lock the cpu frequency.
The longer the test runs for the faster the cpu will run.
I've had to use the performance counters to get accurate cycle counts.

I'm also sure I tried two separate 'adc' chains and failed
to get any overlapped instrictions.
But I'll try his loop in my framework.

The 'lfence' version also really does matter.
In some sense it doesn't matter if you add 10 clocks to the
IP header checksum - they'll be dwarfed by everything else.
But if you do have to checksum an 8k NFS UDP datagram the
loop count matters.

The OOO engine is very good a just piling up instructions
that are waiting for previous instructions in a big queue
and executing instruction for which the data is available.
So the control flow for the checksum code can finish well
before the checksum is available.

On a related point, do you remember what the 'killer app'
was for doing the checksum in copy_to/from_user?

The change pre-dates git and I can't find the commit message.
The only place it is done in the receive path is for UDP.
At a guess that helped 8k UDP datagrams going into a userspace nfsd?
I bet nothing really depends on RX UDP performance to userspace
on systems that don't do hw checksum?

The tx side is done copying data into an skb, I'm not sure if
that is all copies - you really don't want to do it if the
hardware supports tx checksum offload.
Using 'rep movsb' for copy_from_user() has to be faster than
the I-cache busting 'copy and checksum' code - even if you
end up doing a checksum not much later on.

IIRC (I looked a couple of weeks ago) only x86, sparc32 and m68k
actually have 'copy and checksum' (and the m68k didn't seem to
have the markers for faults!).
(All of them are I-cache killers)
The default iovec[] version checksums every fragment - which
gives grief trying to report a 32bit checksum and EFAULT.
OTOH it could sum the linear kernel buffer after the copy.
That would let it fold the checksum to 17bits and report -1
for EFAULT.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)