Re: page fault scalability patch V11 [0/7]: overview

From: William Lee Irwin III
Date: Sat Nov 20 2004 - 03:02:37 EST


Andrew Morton wrote:
>> Given that we have contention problems updating a single mm-wide rss and
>> given that the way to fix that up is to spread things out a bit, it seems
>> wildly arbitrary to me that the way in which we choose to spread the
>> counter out is to stick a bit of it into each task_struct.
>> I'd expect that just shoving a pointer into mm_struct which points at a
>> dynamically allocated array[NR_CPUS] of longs would suffice. We probably
>> don't even need to spread them out on cachelines - having four or eight
>> cpus sharing the same cacheline probably isn't going to hurt much.
>> At least, that'd be my first attempt. If it's still not good enough, try
>> something else.

On Sat, Nov 20, 2004 at 06:13:03PM +1100, Nick Piggin wrote:
> That is what Bill thought too. I guess per-cpu and per-thread rss are
> the leading candidates.
> Per thread rss has the benefits of cacheline exclusivity, and not
> causing task bloat in the common case.
> Per CPU array has better worst case /proc properties, but shares
> cachelines (or not, if using percpu_counter as you suggested).
> I think I'd better leave it to others to finish off the arguments ;)

(1) The "task bloat" is more than tolerable on the systems capable
of having enough cpus to see significant per-process
memory footprint, where "significant" is smaller than a
pagetable page even for systems twice as large as now shipped.
(2) The cacheline exclusivity is not entirely gone in dense per-cpu
arrays, it's merely "approximated" by sharing amongst small
groups of adjacent cpus. This is fine for e.g. NUMA because
those small groups of adjacent cpus will typically be on nearby
nodes.
(3) The price paid to get "perfect exclusivity" instead of "approximate
exclusivity" is unbounded tasklist_lock hold time, which takes
boxen down outright in every known instance.

The properties are not for /proc/, they are for tasklist_lock. Every
read stops all other writes. When you hold tasklist_lock for an
extended period of time for read or write, (e.g. exhaustive tasklist
search) you stop all fork()'s and exit()'s and execve()'s on a running
system. The "worst case" analysis has nothing to do with speed. It has
everything to do with taking a box down outright, much like unplugging
power cables or dereferencing NULL. Unbounded tasklist_lock hold time
kills running boxen dead.

Read sides of rwlocks are not licenses to spin for aeons with locks held.

And the "question" of sufficiency has in fact already been answered.
SGI's own testing during the 2.4 out-of-tree patching cycle determined
that an mm-global atomic counter was already sufficient so long as the
cacheline was not shared with ->mmap_sem and the like. The "simplest"
optimization of moving the field out of the way of ->mmap_sem already
worked. The grander ones, if and *ONLY* if they don't have showstoppers
like unbounded tasklist_lock hold time or castrating workload monitoring
to unusability, will merely be more robust for future systems.
Reiterating, this is all just fine so long as they don't cause any
showstopping problems, like castrating the ability to monitor
processes, or introducing more tasklist_lock starvation.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/