Re: CPU swapping on Linux

From: James Manning (jmm@raleigh.ibm.com)
Date: Mon Jan 24 2000 - 00:36:24 EST


Warning: very low on sleep ... and rambling! :)

[ Sunday, January 23, 2000 ] Rik van Riel wrote:
> Indeed, you're completely right. This is a bug in the kernel
> that explains some strange results I've even been able to see
> on my dual P120...

[snip]

> > in arch/i386/kernel/smpboot.c:
> >
> > cacheflush_time = cpu_hz/1024*cachesize/5000;
> >
> > which, if I'm reading it right, assumes ~5 MB/s dram bandwidth.
> > the calculation should be
> >
> > cacheflushInCycles = cpuHz * cachesizeBytes / bytesPerSecond;

If it was guessing the dram bandwidth was very low, shouldn't it have
been grossly over-estimating cacheflush_time instead of under?

It looks like cachesize, since it comes from x86_cache_size, is the
cache size in KB and not in bytes. Hence the original looks like,
with some rearranging (trying to keep the math the same)

cacheflush_time = cpu_hz * cachesize / (5000*1024)

Multiplying the numerator and denominator by 1k to get cachesize into
bytes, you're looking at (cpu_hz * cachesizeInBytes / 5GB) so it looks
like the original formula grossly overesimates our pipe out to
main memory, hence the flush time being so weakly considered and
processes hopping all around :)

(if I had only 5 MB/sec to memory, I wouldn't waste a single cache line! :)

> > a better calculation would be:
> >
> > static const unsigned bandwidth = 350000000;
> > cacheflush_time = cpu_hz * (cachesize*1024) / bandwidth;

Whatever formula ends up in place, I'd like to see
 - a 3 or 4 line explanation of why this formula makes sense
 - all constants as #define's (rather than const's simply because I
   could see them migrating out to include/asm-*/*) with good names :)
 
Keep in mind, also, that all the way-too-small cacheflush_time did
was keep us from ever taking that fast-exit-doing-nothing out of
reschedule_idle and *only* on SMP (cacheflush_time isn't even considering
in UP). While this may seem bad, it's very close to the typical case
since the very-frequent-reschedulers that this check is designed to
protect against are still the uncommon case.

So each reschedule_idle() on SMP would scan for an idle CPU,
failing that scan for preemption opportunities, first on ourselves
and then all CPU's... those scans are all we're saving (and not
really "saving" per se, since the fast exit doesn't really
get any real work done aside from get out of the way of those
frequent reschdulers).

Warning: if we reverse the cacheflush_time error by a few orders
of magnitude in the opposite direction, the cacheflush_time
gets so large that we (almost :) always take fast exit and
interactive responsiveness will plummet as time slices get
huge (I'm not even sure they'd stay limited to 200ms) and CPU
hogs never get preempted (well, not as early as they should :)
since we skip the preemption scans such a large % of the time.

> > tweaking of PROC_CHANGE_PENALTY and p->mm==this_mm weights might
> > help too.
>
> Definately. Maybe we want to calculate PROC_CHANGE_PENALTY
> and/or the TLB_FLUSH weight on boottime? We don't have to
> worry much about the extra cache access that will give because
> we'll either have a small runqueue or we'll use those two
> values over and over again :)

I totally agree with this idea (hey, I made the patch, I might as
well cheerlead :) In the "typical" case (looking forward to getting
a better idea of what % of the time this fast-exit should be taken on
real systems) we're finding an idle task (no-brainer) or looking for
a preemption possibility. Unfortunately, as it sits that preemption
decision/heuristic is static at compile time. It *is* necessary,
though, to keep goodness() itself as fast as possible recognizing
its high call-count, esp with deep run queues.

My recent patch to take the processor change penalty and TLB flush
penalty (TLB flush gets considered in the UP case, something the
cacheflush_time doesn't currently have going for it) as /proc
entries and dynamically settable, so we can make very effective
scheduling policy changes with these two variables, as between
the both of them both SMP and UP systems can have their whole
preemption heuristics redone.

Ultimately, I'd love to see this becomes Andrea's auto-tuning
idea, but I'm still grasping for how we can be fair, fast, and
adaptive to a changing workload mix... That's the long-term
goal, though, and this first step should help making scheduling
a bit more flexible.

I'm looking forward to hearing back results... if nothing else
than to see whether the "optimal" points for 2-way vs. 8-way
systems are as far apart as I imagine them or if the numbers
are all in a small range.

It's a short patch (against .41-pre2) but I didn't get much sleep
so give it a once-over after applying (or before :) and smack
me upside the head if I broke something badly :)

James

-- 
Miscellaneous Engineer --- IBM Netfinity Performance Development

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jan 31 2000 - 21:00:11 EST