RE: [PATCH] sched: loadavg 0.00, 0.01, 0.05 on idle

From: Doug Smythies
Date: Thu Jan 21 2016 - 13:47:22 EST


On 2016.01.21 07:29 Peter Zijlstra wrote:
> On Thu, Jan 21, 2016 at 10:23:25AM +0100, Vik Heyndrickx wrote:
>> Systems show a minimal load average of 0.00, 0.01, 0.05 even when they have
>> no load at all.
>> ---
>> Subject: sched: Fix non-zero idle loadavg
>> From: Vik Heyndrickx <vik.heyndrickx@xxxxxxxxxxx>
>> Date: Thu, 21 Jan 2016 10:23:25 +0100

>> Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
>> have no load at all.

>> By removing the single code line that performed a rounding on the
>> internally kept load value, effectively returning this function
>> calc_load to its state it had before, the visualization problem is
>> completely fixed.

Yes, but it introduces a systematic error, rather than the current
balanced error. Thus it doubles the maximum error due to finite number
of bits used in the math.

>> Once the (old) load becomes 93 or higher, it mathematically can never
>> get lower than 93, even when the active (load) remains 0 forever.
>> This results in the strange 0.00, 0.01, 0.05 uptime values on idle
>> systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05.

As I mentioned on the bug report [1], this is a consequence
of carrying a finite number of bits with a so very strong
IIR (Infinite Impulse Response) filter coefficient.

>> It is not correct to add a 0.5 rounding (=1024/2048) here, since the
>> result from this function is fed back into the next iteration again,
>> so the result of that +0.5 rounding value then gets multiplied by
>> (2048-2037), and then rounded again, so there is a virtual "ghost"
>> load created, next to the old and active load terms.

If you do not round then you get a doubling of problems on the load
increasing side of things. Consider an old load value of 1862 (90.92%),
regardless of how it got there, and a new load value of 2048 (100%)
from here onwards. With this proposed change, the 15 minute math becomes:

new = (old * 2037 + load * (2048 - 2037)) / 2048
new = (1862 * 2037 + 2048 * (2048 - 2037)) / 2048
new = 1862

So, the 100% load will always be shown as 91% (double the old limit).

I have been running this proposed code with 100% load on CPU 7 for a couple
of hours now, and the 15 minute load average is stuck at 0.91.

Myself, I would not take out the rounding, but I defer to Peter.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=45001