Re: CPA patchset

From: dean gaudet
Date: Fri Jan 11 2008 - 12:57:17 EST


On Fri, 11 Jan 2008, dean gaudet wrote:

> On Fri, 11 Jan 2008, Ingo Molnar wrote:
>
> > * Andi Kleen <ak@xxxxxxx> wrote:
> >
> > > Cached requires the cache line to be read first before you can write
> > > it.
> >
> > nonsense, and you should know it. It is perfectly possible to construct
> > fully written cachelines, without reading the cacheline first. MOVDQ is
> > SSE1 so on basically in every CPU today - and it is 16 byte aligned and
> > can generate full cacheline writes, _without_ filling in the cacheline
> > first.
>
> did you mean to write MOVNTPS above?

btw in case you were thinking a normal store to WB rather than a
non-temporal store... i ran a microbenchmark streaming stores to every 16
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles
whereas the avg latency of MOVAPS is 20 cycles.

the inner loop is unrolled 16 times so there are literally 4 cache lines
worth of stores being stuffed into the store queue as fast as possible...
and there's no coalescing for normal stores even on this modern CPU.

i'm certain i'll see the same thing on AMD... it's a very hard thing to do
in hardware without the non-temporal hint.

-dean


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/