Re: [PATCH RFC v5] cpufreq: schedutil: Make iowait boost more energy efficient

From: Joel Fernandes
Date: Mon Jul 17 2017 - 13:35:16 EST


Hi Viresh,

On Mon, Jul 17, 2017 at 1:04 AM, Viresh Kumar <viresh.kumar@xxxxxxxxxx> wrote:
> On 16-07-17, 01:04, Joel Fernandes wrote:
>> Currently the iowait_boost feature in schedutil makes the frequency go to max
>> on iowait wakeups. This feature was added to handle a case that Peter
>> described where the throughput of operations involving continuous I/O requests
>> [1] is reduced due to running at a lower frequency, however the lower
>> throughput itself causes utilization to be low and hence causing frequency to
>> be low hence its "stuck".
>>
>> Instead of going to max, its also possible to achieve the same effect by
>> ramping up to max if there are repeated in_iowait wakeups happening. This patch
>> is an attempt to do that. We start from a lower frequency (policy->mind)
>
> s/mind/min/
>
>> and double the boost for every consecutive iowait update until we reach the
>> maximum iowait boost frequency (iowait_boost_max).
>>
>> I ran a synthetic test (continuous O_DIRECT writes in a loop) on an x86 machine
>> with intel_pstate in passive mode using schedutil. In this test the iowait_boost
>> value ramped from 800MHz to 4GHz in 60ms. The patch achieves the desired improved
>> throughput as the existing behavior.
>>
>> Also while at it, make iowait_boost and iowait_boost_max as unsigned int since
>> its unit is kHz and this is consistent with struct cpufreq_policy.
>>
>> [1] https://patchwork.kernel.org/patch/9735885/
>>
>> Cc: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
>> Cc: Len Brown <lenb@xxxxxxxxxx>
>> Cc: Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>
>> Cc: Viresh Kumar <viresh.kumar@xxxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Signed-off-by: Joel Fernandes <joelaf@xxxxxxxxxx>
>> ---
>> This version is based on some ideas from Viresh and Juri in v4. Viresh, one
>> difference between the idea we just discussed is, I am scaling up/down the
>> boost only after consuming it. This has the effect of slightly delaying the
>> "deboost" but achieves the same boost ramp time. Its more cleaner in the code
>> IMO to avoid the scaling up and then down on the initial boost. Note that I
>> also dropped iowait_boost_min and now I'm just starting the initial boost from
>> policy->min since as I mentioned in the commit above, the ramp of the
>> iowait_boost value is very quick and for the usecase its intended for, it works
>> fine. Hope this is acceptable. Thanks.
>>
>> kernel/sched/cpufreq_schedutil.c | 31 +++++++++++++++++++++++--------
>> 1 file changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index 622eed1b7658..4225bbada88d 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -53,8 +53,9 @@ struct sugov_cpu {
>> struct update_util_data update_util;
>> struct sugov_policy *sg_policy;
>>
>> - unsigned long iowait_boost;
>> - unsigned long iowait_boost_max;
>> + bool iowait_boost_pending;
>> + unsigned int iowait_boost;
>> + unsigned int iowait_boost_max;
>> u64 last_update;
>>
>> /* The fields below are only needed when sharing a policy. */
>> @@ -172,30 +173,43 @@ static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
>> unsigned int flags)
>> {
>> if (flags & SCHED_CPUFREQ_IOWAIT) {
>> - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
>> + sg_cpu->iowait_boost_pending = true;
>> + sg_cpu->iowait_boost = max(sg_cpu->iowait_boost,
>> + sg_cpu->sg_policy->policy->min);
>> } else if (sg_cpu->iowait_boost) {
>> s64 delta_ns = time - sg_cpu->last_update;
>>
>> /* Clear iowait_boost if the CPU apprears to have been idle. */
>> - if (delta_ns > TICK_NSEC)
>> + if (delta_ns > TICK_NSEC) {
>> sg_cpu->iowait_boost = 0;
>> + sg_cpu->iowait_boost_pending = false;
>> + }
>
> We don't really need to clear this flag here as we are already making
> iowait_boost as 0 and that's what we check while using boost.

Hmm, I would rather clear this flag here and in the loop in
sugov_next_freq_shared since it keeps the flag in sync with what's
happening and is less confusing IMHO.

>
>> }
>> }
>>
>> static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, unsigned long *util,
>> unsigned long *max)
>> {
>> - unsigned long boost_util = sg_cpu->iowait_boost;
>> - unsigned long boost_max = sg_cpu->iowait_boost_max;
>> + unsigned long boost_util, boost_max;
>>
>> - if (!boost_util)
>> + if (!sg_cpu->iowait_boost)
>> return;
>>
>> + boost_util = sg_cpu->iowait_boost;
>> + boost_max = sg_cpu->iowait_boost_max;
>> +
>
> The above changes are not required anymore (and were required only
> with my patch).

Yep, I'll drop it.

>> if (*util * boost_max < *max * boost_util) {
>> *util = boost_util;
>> *max = boost_max;
>> }
>> - sg_cpu->iowait_boost >>= 1;
>> +
>> + if (sg_cpu->iowait_boost_pending) {
>> + sg_cpu->iowait_boost_pending = false;
>> + sg_cpu->iowait_boost = min(sg_cpu->iowait_boost << 1,
>> + sg_cpu->iowait_boost_max);
>
> Now this has a problem. We will also boost after waiting for
> rate_limit_us. And that's why I had proposed the tricky solution in

Not really unless rate_limit_us is < TICK_NSEC? Once TICK_NSEC
elapses, we would clear the boost in sugov_set_iowait_boost and in
sugov_next_freq_shared.

> the first place. I thought we wanted to avoid instant boost only for
> the first iteration, but after that we wanted to do it ASAP. Isn't it?
>
> Now that you are using policy->min instead of policy->cur, we can
> simplify the solution I proposed and always do 2 * iowait_boost before

No, doubling on the first boost was never discussed or intended in my
earlier patches. I thought even your patch never did, you were
dividing by 2, and then scaling it back up by 2 before consuming it to
preserve the initial boost.

> getting current util/max in above if loop. i.e. we will start iowait
> boost with min * 2 instead of min and that should be fine.

Hmm, but why start from double of min? Why not just min? It doesn't
make any difference to the intended behavior itself and is also
consistent with my proposal in RFC v4. Also I feel what you're
suggesting is more spike prone as well, the idea was to start from the
minimum and double it as we go, not to double the min the first go.
That was never intended.

Also I would rather keep the "set and use and set and use" pattern to
keep the logic less confusing and clean IMO.
So we set initial boost in sugov_set_iowait_boost, and then in
sugov_iowait_boost we use it, and then set the boost for the next time
around at the end of sugov_iowait_boost (that is we double it). Next
time sugov_set_iowait_boost wouldn't touch the boost whether iowait
flag is set or not and we would continue into sugov_iowait_boost to
consume the boost. This would have a small delay in reducing the
boost, but that's Ok since its only one cycle of delay, and keeps the
code clean. I assume the last part is not an issue considering you're
proposing double of the initial boost anyway ;-)

thanks,

-Joel