Re: [PATCH] Rmap speedup

From: Andrew Morton (
Date: Sat Aug 03 2002 - 16:40:41 EST

Daniel Phillips wrote:
> On Saturday 03 August 2002 07:24, Andrew Morton wrote:
> > - page_add_rmap has vanished
> > - page_remove_rmap has halved (80% of the remaining is the
> > list walk)
> > - we've moved the cost into the new locking site, zap_pte_range
> > and copy_page_range.
> > So rmap locking is still a 15% slowdown on my soggy quad, which generally
> > seems relatively immune to locking costs.
> What is it about your quad?

Dunno. I was discussing related things with David M-T yesterday. A
`lock;incl' on a P4 takes ~120 cycles, everything in-cache. On my PIII
it's 33 cycles. 12-16 on ia64. Lots of variation there.

Instrumentation during your 35 second test shows:

- Performed an rmap lock 6.5M times
- Got a hit on the cached lock 9.2M times
- Got a miss on the cached lock 2.6M times.

So the remaining (6.5M - 2.6M) locks were presumably in the
pagefault handlers.

lockmeter results are at

- We took 3,000,000 rwlocks (mainly pagecache)
- We took 20,000,000 spinlocks
- the locking is spread around copy_mm, copy-page_range,
  do_no_page, handle_mm_fault, page_add_rmap, zap_pte_range
- total amount of CPU time lost spinning on locks is 1%, mainly
  in page_add_rmap and zap_pte_range.

That's not much spintime. The total system time with this test went
from 71 seconds (2.5.26) to 88 seconds (2.5.30). (4.5 seconds per CPU)
So all the time is presumably spent waiting on cachelines to come from
other CPUs, or from local L2.

lockmeter results for 2.5.26 are at

- 2.5.26 took 17,000,000 spinlocks
- but 3,000,000 of those were kmap_lock and pagemap_lru_lock, which
  have been slaughtered in my tree. rmap really added 6,000,000
  locks to 2.5.30.

Running the same test on 2.4:

        ./ 35.12s user 65.96s system 363% cpu 27.814 total
        ./ 35.95s user 64.77s system 362% cpu 27.763 total
        ./ 34.99s user 66.46s system 364% cpu 27.861 total

        ./ 36.20s user 106.80s system 363% cpu 39.316 total
        ./ 38.76s user 118.69s system 399% cpu 39.405 total
        ./ 35.47s user 106.90s system 364% cpu 39.062 total

2.4.19-pre7+rmap-13b+your patch:
        ./ 33.72s user 97.20s system 364% cpu 35.904 total
        ./ 35.18s user 94.48s system 363% cpu 35.690 total
        ./ 34.83s user 95.66s system 363% cpu 35.921 total

The system time is pretty gross, isn't it?

And it's disproportional to the increased number of lockings.

> ...
> But before we start on the micro-optimization we need to know why your quad
> is so unaffected by the big change.

We need a little test proggy to measure different platforms cache
load latency, locked operations, etc. I have various bits-n-pieces,
will put something together.

> Are you sure the slab cache batching of
> pte chain allocation performs as well as my simpleminded inline batching?

Slab is pretty good, I find. And there's no indication of a problem
in the profiles.

> (I batched the pte chain allocation lock quite nicely.) What about the bit
> test/set for the direct rmap pointer, how is performance affected by dropping
> the direct lookup optimization?

It didn't show in the instruction-level oprofiling.

> Note that you are holding the rmap lock
> considerably longer than I was, by holding it across __page_add_rmap instead
> of just across the few instructions where pointers are actually updated. I'm
> also wondering if gcc is optimizing your cached_rmap_lock inline as well as
> you think it is.
> I really need to be running on 2.5 so I can crosscheck your results. I'll
> return to the matter of getting the dac960 running now.

Sigh. No IDE?

> Miscellaneous question: we are apparently adding rmaps to reserved pages, why
> is that?

That's one for Rik...
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Wed Aug 07 2002 - 22:00:22 EST