Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expiredwith a single array

From: Peter Williams
Date: Fri Jun 04 2004 - 02:43:17 EST

Next message: Rick Jansen: "DriveReady SeekComplete Error"
Previous message: Russell King: "Re: Dell TrueMobile 1150 PCMCIA/Orinoco/Yenta problem w/ 2.6.4/5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Andi Kleen wrote:

Con Kolivas <kernel@xxxxxxxxxxx> writes:

I think your aims of simplifying the scheduler are admirable but I hope you don't suffer the quagmire that is manipulating the interactivity stuff. Changing one value and saying it has no apparent effect is almost certainly wrong; surely it was put there for a reason - or rather I put it there for a reason.

But that doesn't mean that the reason cannot be reevaluated later.
If Peter can up with a simpler scheduler and nobody can break it significantly
it would be great, and i'm hope such simplifications could be merged
after testing. Certainly the current one does far too much black magic.

As a result of your encouragement I have implemented a simplified version of the interactive bonus scheme and an extension whose aim is to improve general system throughput on to of my SPA patch (to the 2.6.6 kernel) an updated version of which is available at:

<http://users.bigpond.net.au/Peter-Williams/patch-2.6.6-spa-v2>

Interactive Bonus (SPA_IA_BONUS) patch is available at:

<http://users.bigpond.net.au/Peter-Williams/patch-2.6.6-spa_ia_bonus>

This patch approaches the problem from a control systems perspective and the statistics are calculated for each task (using extremely simple and efficient Kalman filters):

A_c which is the running average number of nanoseconds of CPU time that it spends on the CPU during a scheduling cycle (SC) which is defined to be the period from one activation (or wake up) to the next.

A_s which is the running average number of nanoseconds that it spends sleeping during the SC

A_d which is the running average number of nanoseconds that it spends on a run queue waiting for CPU access

These statistics are then used to make the following tests about the task:

1. is it showing interactive behaviour, and/or
2. is it showing CPU bound behaviour.

The first test consists of testing that A_s / (A_s + A_c) (where A_s is the average sleep time per cycle and A_c is the average CPU time per cycle) is greater than a threshold (currently 90%) (NB this is equivalent to A_c / (A_c + A_s) being less than 10%) and also that the average sleep time isn't greater than a threshold (currently set at 15 minutes). This test is applied at the end of each scheduling cycle when the task is woken and put back on the run queue. If the task passes this test its interactive bonus is increased asymptotically towards the maximum bonus (or, at least, the maximum multiplied by their average A_s / (A_s + A_c)).

The second test is applied at the end of a cycle if the above test fails and also at the end of each time slice (if the task stays active for more than one time slice). This task consists of testing whether the ratio A_c / (A_s + A_c + A_d) (where A_d is the average delay on run queues waiting for CPU access per cycle) is greater a threshold (currently 50%) or A_s is zero or very large (currently greater than 2 hours) and if the task passes this test it is considered to be CPU bound or a chronic idler and NOT interactive and its interactive bonus is decreased asymptotically towards zero. The reason that A_c / (A_s + A_c + A_d) instead of the equivalent A_c / (A_c + A_s) form the first test is so that interactive tasks are not mistakenly classified as CPU bound tasks when the system is very busy and their wait time becomes squeezed and turns into run queue delay time. Similarly, the reason that A_c / (A_c + A_s + A_d) isn't used in the first test is that it could lead to CPU bound tasks being wrongly classed as interactive tasks when the system is very busy.

In the description of the first test, I mention that the interactive bonus of a tasks asymptotically approaches the product of the maximum bonus and their average A_s / (A_s + A_c). This has the important side effect that the interactive bonus of a fairly busy task such as the X server doesn't get as big bonus as that of those low CPU usage stream servers such as xmms and they work quite happily under extremely heavy loads including heavy X server activity.

Although this may sound complex and expensive it's actually quite cheap as estimating the averages is very simple and the more complex bits (involving 64 bit division to calculate the ratios) occur relatively infrequently.

Throughput Bonus (SPA_TPT_BONUS) patch is available at:

<http://users.bigpond.net.au/Peter-Williams/patch-2.6.6-spa_tpt_bonus>

This bonus is also based on the statistics described above and awards bonus points to tasks that are spending a disproportionate amount of time (compared to the CPU time they are using) on run queues. The maximum amount of this bonus is half that of the maximum interactive bonus so it should not cause tasks receiving it to interfere with them.
The bonus awarded is proportional to the equation: (A_d / (A_d + A_c * N)) where N is the number of tasks running on the CPU in question. This latter correction is necessary to stop all tasks getting the maximum bonus when the system is busy. Unlike the interactive bonus this bonus is not persistent and is recalculated every time the task is to be requeued. When the system is not heavily loaded this bonus has the tendency to reduce the overall amount of queuing on the system and improve throughput. When the system is heavily loaded it tends to spread the pain evenly among the non interactive tasks (subject, of course, to niceness).

Hopefully my explanations above don't fall into the "black magic" category and the mechanics of the scheduler are easy to understand? :-)

If not and you have any questions don't hesitate to ask
Peter
PS Patches for 2.6.7-rc2 should be available next week.
--
Dr Peter Williams pwil3058@xxxxxxxxxxxxxx

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rick Jansen: "DriveReady SeekComplete Error"
Previous message: Russell King: "Re: Dell TrueMobile 1150 PCMCIA/Orinoco/Yenta problem w/ 2.6.4/5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]