Re: page table lock patch V15 [0/7]: overview

From: Andi Kleen
Date: Thu Jan 13 2005 - 23:41:45 EST


On Thu, Jan 13, 2005 at 05:09:04PM -0800, Christoph Lameter wrote:
> On Thu, 13 Jan 2005, Andi Kleen wrote:
>
> > On Thu, Jan 13, 2005 at 09:11:29AM -0800, Christoph Lameter wrote:
> > > On Wed, 13 Jan 2005, Andi Kleen wrote:
> > >
> > > > Alternatively you can use a lazy load, checking for changes.
> > > > (untested)
> > > >
> > > > pte_t read_pte(volatile pte_t *pte)
> > > > {
> > > > pte_t n;
> > > > do {
> > > > n.pte_low = pte->pte_low;
> > > > rmb();
> > > > n.pte_high = pte->pte_high;
> > > > rmb();
> > > > } while (n.pte_low != pte->pte_low);
> > > > return pte;
>
> I think this is not necessary. Most IA32 processors do 64
> bit operations in an atomic way in the same way as IA64. We can cut out
> all the stuff we put in to simulate 64 bit atomicity for i386 PAE mode if
> we just use convince the compiler to use 64 bit fetches and stores. 486

That would mean either cmpxchg8 (slow) or using MMX/SSE (even slower
because you would need to save FPU stable and disable
exceptions).

I think FPU is far too slow and complicated. I benchmarked lazy read
and cmpxchg 8:

Athlon64:
readpte hot 42
readpte cold 426
readpte_cmp hot 33
readpte_cmp cold 2693

Nocona:
readpte hot 140
readpte cold 960
readpte_cmp hot 48
readpte_cmp cold 2668

As you can see cmpxchg is slightly faster for the cache hot case,
but incredibly slow for cache cold (probably because it does something
nasty on the bus). This is pretty consistent to Intel and AMD CPUs.
Given that page tables are likely more often cache cold than hot
I would use the lazy variant.

-Andi


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/