Re: sched/cpufreq: Rework schedutil governor performance estimation - Regression bisected

From: Vincent Guittot
Date: Fri Feb 16 2024 - 08:18:16 EST


Hi Doug,

On Thu, 15 Feb 2024 at 23:53, Doug Smythies <dsmythies@xxxxxxxxx> wrote:
>
> Hi Vincent,
>
> This email thread appears as if it might be moving away from a regression
> caused by your commit towards a conclusion that your commit exposed
> a pre-existing bug in the intel_psate.c code.

Ok

>
> Therefore, I have moved Rafael from the C.C. line to the "to" line and
> added Srinivas.
>
> On 2024.02.14 07:38 Vincent wrote:
> > On Tue, 13 Feb 2024 at 19:07, Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> >> On 2024.02.13 03:27 Vincent wrote:
> >>> On Sun, 11 Feb 2024 at 17:43, Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> >>>> On 2024.02.11 05:36 Vincent wrote:
> >>>>> On Sat, 10 Feb 2024 at 00:16, Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> >>>>>> On 2024.02.09.14:11 Vincent wrote:
> >>>>>>> On Fri, 9 Feb 2024 at 22:38, Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> I noticed a regression in the 6.8rc series kernels. Bisecting the kernel pointed to:
> >>>>>>>>
> >>>>>>>> # first bad commit: [9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d]
> >>>>>>>> sched/cpufreq: Rework schedutil governor performance estimation
> >>>>>>>>
> >>>>>>>> There was previous bisection and suggestion of reversion,
> >>>>>>>> but I guess it wasn't done in the end. [1]
> >>>>>>>
> >>>>>>> This has been fixed with
> >>>>>>> https://lore.kernel.org/all/170539970061.398.16662091173685476681.tip-bot2@tip-bot2/
> >>>>>>
> >>>>>> Okay, thanks. I didn't find that one.
> >>>>>>
> >>>>>>>> The regression: reduced maximum CPU frequency is ignored.
> >>>>
> >>>> Perhaps I should have said "sometimes ignored".
> >>>> With a maximum CPU frequency for all CPUs set to 2.4 GHz and
> >>>> a 100% load on CPU 5, its frequency was sampled 1000 times:
> >>>> 28.6% of samples were 2.4 GHz.
> >>>> 71.4% of samples were 4.8 GHz (the max turbo frequency)
> >>>> The results are highly non-repeatable, for example another sample:
> >>>> 32.8% of samples were 2.4 GHz.
> >>>> 76.2% of samples were 4.8 GHz
> >>>>
> >>>> Another interesting side note: If load is added to the other CPUs,
> >>>> the set maximum CPU frequency is enforced.
> >>>
> >>> Could you trace cpufreq and pstate ? I'd like to understand how
> >>> policy->cur can be changed
> >>> whereas there is this comment in intel_pstate_set_policy():
> >>> /*
> >>> * policy->cur is never updated with the intel_pstate driver, but it
> >>> * is used as a stale frequency value. So, keep it within limits.
> >>> */
> >>>
> >>> but cpufreq_driver_fast_switch() updates it with the freq returned by
> >>> intel_cpufreq_fast_switch()
> >>
> >> Perhaps I should submit a patch clarifying that comment.
> >> It is true for the "intel_pstate" CPU frequency scaling driver but not for the
> >> "intel_cpufreq" CPU frequency scaling driver, also known as the intel_pstate
> >> driver in passive mode. Sorry for any confusion.
> >>
> >> I ran the intel_pstate_tracer.py during the test and do observe many, but
> >> not all, CPUs requesting pstate 48 when the max is set to 24.
> >> The calling request seems to always be via "fast_switch" path.
> >> The root issue here appears to be a limit clamping problem for that path.
> >
> > Yes, I came to a similar conclusion as well. Whatever does schedutil
> > ask for, it should be clamped by cpu->max|min_perf_ratio.
>
> Agreed. And it is not clamping properly under specific conditions.
>
> > Do you know if you use fast_switch or adjust_perf call back ?
>
> I am not certain, but I think it uses "adjust_perf" call back.
> I do know for certain that it never takes the
> "intel_cpufreq_update_pstate" path
> and always takes the
> "intel_cpu_freq_adjust_perf" path.

intel_cpu_freq_adjust_perf is registered as the callback for
cpufreq->adjust_perf

>
> The problem seems to occur when that function is called with:
> min_perf = 1024
> target_perf = 1024
> capacity = 1024
>
> Even though cpu->max_perf_ratio is 24, the related HWP MSR,
> 0x774: IA32_HWP_REQUEST, ends up as 48, 48, 48 for min, max, des.
>
> This patch appears to fix the issue (still has my debug code and
> includes a question):
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index ca94e60e705a..8f88a04a494b 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2987,12 +2987,22 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> if (min_pstate < cpu->min_perf_ratio)
> min_pstate = cpu->min_perf_ratio;
>
> +// if (min_pstate > cpu->pstate.max_pstate) /* needed? I don't know */
> +// min_pstate = cpu->pstate.max_pstate;
> +
> + if (min_pstate > cpu->max_perf_ratio)
> + min_pstate = cpu->max_perf_ratio;
> +
> max_pstate = min(cap_pstate, cpu->max_perf_ratio);
> if (max_pstate < min_pstate)
> max_pstate = min_pstate;
>
> target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
>
> + if((max_pstate > 40) || (max_pstate < 7) || (min_pstate < 7) || min_pstate > 40 || target_pstate > 40){
> + pr_debug("Doug: t: %d : min %d : max %d : minp %d : maxp %d : mnperf %lu : tgperf %lu : capacity %lu\n", target_pstate, min_pstate, max_pstate, cpu->min_perf_ratio, cpu->max_perf_ratio, min_perf, target_perf, capacity);
> + }
> +
> intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
>
> cpu->pstate.current_pstate = target_pstate;
>
> With the patch, I never hit the debug statement if the max CPU frequency is limited to 2.4 GHz,
> whereas it used to get triggered often.
> More importantly, the system seems to now behave properly and obey set CPU frequency limits.
>
>