RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

From: David Laight
Date: Thu Dec 02 2021 - 16:11:48 EST


From: Noah Goldstein
> Sent: 02 December 2021 20:19
>
> On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> >
> > On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > >
> > > I've dug out my test program and measured the performance of
> > > various copied of the inner loop - usually 64 bytes/iteration.
> > > Code is below.
> > >
> > > It uses the hardware performance counter to get the number of
> > > clocks the inner loop takes.
> > > This is reasonable stable once the branch predictor has settled down.
> > > So the different in clocks between a 64 byte buffer and a 128 byte
> > > buffer is the number of clocks for 64 bytes.
>
> Intuitively 10 passes is a bit low.

I'm doing 10 separate measurements.
The first one is much slower because the cache is cold.
All the ones after (typically) number 5 or 6 tend to give the same answer.
10 is plenty to give you that 'warm fuzzy feeling' that you've got
a consistent answer.

Run the program 5 or 6 times with the same parameters and you sometimes
get a different stable value - probably something to do with stack and
data physical pages.
Was more obvious when I was timing a system call.

> Also you might consider aligning
> the `csum64` function and possibly the loops.

Won't matter here, instruction decode isn't the problem.
Also the uops all come out of the loop uop cache.

> There a reason you put ` jrcxz` at the beginning of the loops instead
> of the end?

jrcxz is 'jump if cx zero' - hard to use at the bottom of a loop!

The 'paired' loop end instruction is 'loop' - decrement %cx and jump non-zero.
But that is 7+ cycles on current Intel cpu (ok on amd ones).

I can get a two clock loop with jrcxz and jmp - as in the examples.
But it is more stable taken out to 4 clocks.

You can't do a one clock loop :-(

> > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
> > >
> > > What is interesting is that even some of the trivial loops appear
> > > to be doing 16 bytes per clock for short buffers - which is impossible.
> > > Checksum 1k bytes and you get an entirely different answer.
> > > The only loop that really exceeds 8 bytes/clock for long buffers
> > > is the adxc/adoc one.
> > >
> > > What is almost certainly happening is that all the memory reads and
> > > the dependant add/adc instructions are all queued up in the 'out of
> > > order' execution unit.
> > > Since 'rdpmc' isn't a serialising instruction they can still be
> > > outstanding when the function returns.
> > > Uncomment the 'rdtsc' and you get much slower values for short buffers.
>
> Maybe add an `lfence` before / after `csum64`

That's probably less strong than rdtsc, I might try it.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)