Re: [PATCH] Rmap speedup

From: Andrew Morton (
Date: Wed Aug 07 2002 - 14:40:59 EST

Daniel Phillips wrote:
> ...
> > This patch _has_ to help, even if not on my box. Plus it's gorgeous ;) See
> > how it minimises internal fragmentation and cache misses at the same time.
> It is very nice, and short. It would be a shame if it didn't actually
> accomplish anything. Have you got any more data on that?

Nope. Just warm and fuzzies.

> What stands out for me is that rmap is now apparently at parity with
> (virtual scanning) 2.4.19 for a real application, i.e., parallel make.
> I'm sure we'll still see the raw setup/teardown overhead if we make a
> point of going looking for it, but it would be weird if we didn't.
> So can we tentatively declare victory of the execution overhead issue?

err, no. It's still adding half a millisecond or so to each fork/exec/exit
cycle. And that is arising from, apparently, an extra two cache misses
per page. Doubt if we can take this much further.

> ...
> Vectoring up the pte chain nodes as
> you do here doesn't help much because the internal fragmentation
> roughly equals the reduction in link fields.

Are you sure about that? The vectoring is only a loss for very low
sharing levels, at which the space consumption isn't a problem anyway.
At high levels of sharing it's almost a halving.

> So coalescing is the next big evil problem to tackle. It occurs to me
> that defragging pte chain node pages is actually easy and local.

If we are forced to actively defragment pte_chain pages then
it's all just getting too much, surely.

> ...
> Anyway, I think we are within sight of running out of serious issues
> and now is the time to start testing the thing that rmap is supposed
> to improve, namely swap performance.
> Rmap is clearly going to help a lot with swapout on heavily shared
> pages by giving us active unmapping, so we don't have to leave large
> numbers of pages sitting around half in memory and half in swap. But
> rmap doesn't help a bit with that same nasty behaviour on swapin; we
> still have to wait patiently for every sharer to get around to
> faulting in the page, and in my mind, that makes it very easy to
> construct a test case that turns all of swap into a useless appendage,
> full of half swapped-in pages.

Is it useful to instantiate the swapped-in page into everyone's
pagetables, save some minor faults?

> > Should we count PageDirect rmaps in /proc/meminfo:ReverseMaps?
> > I chose not to, so we can compare that number with the slabinfo
> > for pte_chains and see how much memory is being wasted. Plus the
> > PageDirect rmaps aren't very interesting, because they don't consume
> > any resources.
> Agreed. It would be nice to have a stat for total ptes mapped, and we
> can see how total pte chain nodes compares to that. These two stats
> will allow users to form an accurate impression of just how much memory
> is being eaten by pte chains.

This can be determined from a bit of math on
/proc/meminfo:ReverseMaps and /proc/slabinfo:pte_chains
> ...
> > [...]
> >
> > + * - If a page has a pte_chain list then it is shared by at least two processes,
> > + * because a single sharing uses PageDirect. (Well, this isn't true yet,
> > + * coz this code doesn't collapse singletons back to PageDirect on the remove
> > + * path).
> That's not necessarily such a bad thing. Consider what happens with bash:
> first a bunch of do_no_page events cause the direct links to be created as
> bash faults in the various mmaps it uses, then it forks to exec a command,
> causing all the direct links to turn into pte chain nodes, then the command
> execs. If you collapse all the pte chain nodes back to direct links at
> this point they're just going to be created again on the next fork/exec,
> in other words, lots of extra churning.

> So I'd suggest just forgetting about the collapse-to-direct feature. With
> pte chain coalescing we'll eventually get the memory back.

I implemented Rik's suggestion. The collapse back to direct representation
happens in page_referenced(). So it happens only in response to page reclaim
activity, when the system needs memory. Nice, yes?

> > [...]
> >
> > + * - Insertion into the pte_chain puts a pte pointer in the last free slot of
> > + * the head member.
> I don't know why you decided to fill in from top to bottom but I suppose it
> scarcely makes a difference. You might want to run this one backwards and
> take an early exit on null:
> > + for (i = 0; i < NRPTE; i++) {
> > + pte_t *p = pc->ptes[i];
> > + if (p && ptep_test_and_clear_young(p))
> > + referenced++;
> > + }

Thanks, did that.

> >
> > [...]
> >
> > + * - Removal from a pte chain moves the head pte of the head member onto the
> > + * victim pte and frees the head member if it became empty.
> Thus randomizing the order and losing the pushdown effect, oh well.

Only a little, I expect. If the access pattern is mostly LIFO then we'll
still get good locality at the head of the chain.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Wed Aug 07 2002 - 22:00:37 EST