Re: [GIT PULL] Scheduler changes for v6.8

From: Vincent Guittot
Date: Sun Jan 14 2024 - 06:18:21 EST


Hi Wyes,

Le dimanche 14 janv. 2024 à 14:42:40 (+0530), Wyes Karny a écrit :
> On Wed, Jan 10, 2024 at 02:57:14PM -0800, Linus Torvalds wrote:
> > On Wed, 10 Jan 2024 at 14:41, Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > It's one of these two:
> > >
> > > f12560779f9d sched/cpufreq: Rework iowait boost
> > > 9c0b4bb7f630 sched/cpufreq: Rework schedutil governor performance estimation
> > >
> > > one more boot to go, then I'll try to revert whichever causes my
> > > machine to perform horribly much worse.
> >
> > I guess it should come as no surprise that the result is
> >
> > 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d is the first bad commit
> >
> > but to revert cleanly I will have to revert all of
> >
> > b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> > performance estimation")
> >
> > This is on a 32-core (64-thread) AMD Ryzen Threadripper 3970X, fwiw.
> >
> > I'll keep that revert in my private test-tree for now (so that I have
> > a working machine again), but I'll move it to my main branch soon
> > unless somebody has a quick fix for this problem.
>
> Hi Linus,
>
> I'm able to reproduce this issue with my AMD Ryzen 5600G system. But
> only if I disable CPPC in BIOS and boot with acpi-cpufreq + schedutil.
> (I believe for your case also CPPC is diabled as log "_CPC object is not
> present" came). Enabling CPPC in BIOS issue not seen in my system. For
> AMD acpi-cpufreq also uses _CPC object to determine the boost ratio.
> When CPPC is disabled in BIOS something is going wrong and max
> capacity is becoming zero.
>
> Hi Vincent, Qais,
>
> I have collected some data with bpftracing:

Thanks for your tests results

>
> sudo bpftrace -e 'kretprobe:effective_cpu_util /cpu == 1/ { @eff_util = lhist(retval, 0, 1200, 50);} kprobe:get_next_freq /cpu == 1/ { @sugov_eff_util = lhist(arg1, 0, 1200, 50); @sugov_max_cap = lhist(arg2, 0, 1000, 2);} kretprobe:get_next_freq /cpu == 1/ { @sugov_freq = lhist(retval, 1000000, 5000000, 100000);}'
>
> with running: taskset -c 1 make
>
> issue case:
>
> Attaching 3 probes...
> @eff_util:
> [0, 50) 1263 |@ |
> [50, 100) 517 | |
> [100, 150) 233 | |
> [150, 200) 297 | |
> [200, 250) 162 | |
> [250, 300) 98 | |
> [300, 350) 75 | |
> [350, 400) 205 | |
> [400, 450) 210 | |
> [450, 500) 16 | |
> [500, 550) 1532 |@ |
> [550, 600) 1026 | |
> [600, 650) 761 | |
> [650, 700) 876 | |
> [700, 750) 1085 | |
> [750, 800) 891 | |
> [800, 850) 816 | |
> [850, 900) 983 | |
> [900, 950) 661 | |
> [950, 1000) 759 | |
> [1000, 1050) 57433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>

ok so the output of effective_cpu_util() seems correct or at least to maw utilization
value. In order to be correct, it means that arch_scale_cpu_capacity(cpu) is not zero
because of :

unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
unsigned long *min,
unsigned long *max)
{
unsigned long util, irq, scale;
struct rq *rq = cpu_rq(cpu);

scale = arch_scale_cpu_capacity(cpu);

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
* because of inaccuracies in how we track these -- see
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
if (unlikely(irq >= scale)) {
if (min)
*min = scale;
if (max)
*max = scale;
return scale;
}
..
}

If arch_scale_cpu_capacity(cpu) returns 0 then effective_cpu_util() should returns
0 too.

Now see below

> @sugov_eff_util:
> [0, 50) 1074 | |
> [50, 100) 571 | |
> [100, 150) 259 | |
> [150, 200) 169 | |
> [200, 250) 237 | |
> [250, 300) 156 | |
> [300, 350) 91 | |
> [350, 400) 46 | |
> [400, 450) 52 | |
> [450, 500) 195 | |
> [500, 550) 175 | |
> [550, 600) 46 | |
> [600, 650) 493 | |
> [650, 700) 1424 |@ |
> [700, 750) 646 | |
> [750, 800) 628 | |
> [800, 850) 612 | |
> [850, 900) 840 | |
> [900, 950) 893 | |
> [950, 1000) 640 | |
> [1000, 1050) 60679 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_freq:
> [1400000, 1500000) 69911 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_max_cap:
> [0, 2) 69926 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

In get_next_freq(struct sugov_policy *sg_policy, unsigned long util, unsigned long max)

max is 0 and we comes from this path:

static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
unsigned int flags)
{

..
max_cap = arch_scale_cpu_capacity(sg_cpu->cpu);

if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
return;

next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
..

so here arch_scale_cpu_capacity(sg_cpu->cpu) returns 0 ...

AFAICT, AMD platform uses the default
static __always_inline
unsigned long arch_scale_cpu_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}

I'm missing something here

>
>
> good case:
>
> Attaching 3 probes...
> @eff_util:
> [0, 50) 246 |@ |
> [50, 100) 150 |@ |
> [100, 150) 191 |@ |
> [150, 200) 239 |@ |
> [200, 250) 117 | |
> [250, 300) 2101 |@@@@@@@@@@@@@@@ |
> [300, 350) 2284 |@@@@@@@@@@@@@@@@ |
> [350, 400) 713 |@@@@@ |
> [400, 450) 151 |@ |
> [450, 500) 154 |@ |
> [500, 550) 1121 |@@@@@@@@ |
> [550, 600) 1901 |@@@@@@@@@@@@@ |
> [600, 650) 1208 |@@@@@@@@ |
> [650, 700) 606 |@@@@ |
> [700, 750) 557 |@@@ |
> [750, 800) 872 |@@@@@@ |
> [800, 850) 1092 |@@@@@@@ |
> [850, 900) 1416 |@@@@@@@@@@ |
> [900, 950) 1107 |@@@@@@@ |
> [950, 1000) 1051 |@@@@@@@ |
> [1000, 1050) 7260 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_eff_util:
> [0, 50) 241 | |
> [50, 100) 149 | |
> [100, 150) 72 | |
> [150, 200) 95 | |
> [200, 250) 43 | |
> [250, 300) 49 | |
> [300, 350) 19 | |
> [350, 400) 56 | |
> [400, 450) 22 | |
> [450, 500) 29 | |
> [500, 550) 1840 |@@@@@@ |
> [550, 600) 1476 |@@@@@ |
> [600, 650) 1027 |@@@ |
> [650, 700) 473 |@ |
> [700, 750) 366 |@ |
> [750, 800) 627 |@@ |
> [800, 850) 930 |@@@ |
> [850, 900) 1285 |@@@@ |
> [900, 950) 971 |@@@ |
> [950, 1000) 946 |@@@ |
> [1000, 1050) 13839 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_freq:
> [1400000, 1500000) 648 |@ |
> [1500000, 1600000) 0 | |
> [1600000, 1700000) 0 | |
> [1700000, 1800000) 25 | |
> [1800000, 1900000) 0 | |
> [1900000, 2000000) 0 | |
> [2000000, 2100000) 0 | |
> [2100000, 2200000) 0 | |
> [2200000, 2300000) 0 | |
> [2300000, 2400000) 0 | |
> [2400000, 2500000) 0 | |
> [2500000, 2600000) 0 | |
> [2600000, 2700000) 0 | |
> [2700000, 2800000) 0 | |
> [2800000, 2900000) 0 | |
> [2900000, 3000000) 0 | |
> [3000000, 3100000) 0 | |
> [3100000, 3125K) 0 | |
> [3125K, 3300000) 0 | |
> [3300000, 3400000) 0 | |
> [3400000, 3500000) 0 | |
> [3500000, 3600000) 0 | |
> [3600000, 3700000) 0 | |
> [3700000, 3800000) 0 | |
> [3800000, 3900000) 0 | |
> [3900000, 4000000) 23879 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @sugov_max_cap:
> [0, 2) 24555 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> In both case max_cap is zero but selected freq is incorrect in bad case.

Also we have in get_next_freq():
freq = map_util_freq(util, freq, max);
--> util * freq /max

If max was 0, we should have been an error ?

There is something strange that I don't understand

Could you trace on the return of sugov_get_util()
the value of sg_cpu->util ?

Thanks for you help
Vincent

>
> Thanks,
> Wyes
>