Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: H. Peter Anvin
Date: Thu Oct 17 2013 - 14:21:01 EST


On 10/17/2013 01:41 AM, Ingo Molnar wrote:
>
> To correctly simulate the workload you'd have to:
>
> - allocate a buffer larger than your L2 cache.
>
> - to measure the effects of the prefetches you'd also have to randomize
> the individual buffer positions. See how 'perf bench numa' implements a
> random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> Otherwise the CPU might learn your simplistic stream direction and the
> L2 cache might hw-prefetch your data, interfering with any explicit
> prefetches the code does. In many real-life usecases packet buffers are
> scattered.
>
> Also, it would be nice to see standard deviation noise numbers when two
> averages are close to each other, to be able to tell whether differences
> are statistically significant or not.
>

Seriously, though, how much does it matter? All the above seems likely
to do is to drown the signal by adding noise.

If the parallel (threaded) checksumming is faster, which theory says it
should and microbenchmarking confirms, how important are the
macrobenchmarks?

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/