Re: [PATCH] Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit

From: Ingo Molnar
Date: Fri Jan 12 2024 - 13:24:41 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> > I can provide a clean revert of only :
> > f12560779f9d ("sched/cpufreq: Rework iowait boost")
> > 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
>
> I've done this too, see this new commit in sched/urgent:
>
> 60ee1706bd11 ("Revert "sched/cpufreq: Rework schedutil governor performance estimation" and dependent commit")
>
> Also attached below.
>
> > if the fix that i proposed doesn't work:
> > https://lore.kernel.org/all/ZZ+ixagkxRPYyTCE@vingu-book/
>
> Yeah - although of course Linus is free to just pull the revert as well.
> I'll try to reproduce the regression locally as well.

Update & heads up: unfortunately I'm unable to reproduce the regression on
a fairly similar system with a Threadripper 3970X CPU.

Kernel build times are very close, with or without the revert, on vanilla
v6.7 or v6.7+sched/core.

Here's a few results where I tried to quantify kernel build times without
having to wait a long time.

Re-building the kernel/**.o object files in a loop:

$ perf stat --pre 'rm -f kernel/*.o kernel/*/*.o kernel/*/*/*.o' --null --sync --repeat 3 make -j64 kernel/ >/dev/null


# v6.7.0:
# bootup default schedutil governor:
24.521 +- 0.077 seconds time elapsed ( +- 0.31% )
24.644 +- 0.071 seconds time elapsed ( +- 0.29% )

# cpufreq-max:
24.452 +- 0.110 seconds time elapsed ( +- 0.45% )
24.482 +- 0.048 seconds time elapsed ( +- 0.20% )

# v6.7.0+sched/core:
# bootup default schedutil governor:
24.666 +- 0.063 seconds time elapsed ( +- 0.26% )
24.809 +- 0.118 seconds time elapsed ( +- 0.48% )

The fully-cached build numbers are very close to each other, and during the
hot phase of the kernel build all CPUs are saturated.

The 2x performance regression that Linus is seeing is either some
pathological wakeup behavior, or perhaps the cores don't transition
frequencies? The difference between the lowest and highest frequency is
pretty substantial (at least on my box):

cpu MHz : 2200.000
...
cpu MHz : 4000.000


There was *one* test when the tree was cache-cold, when I saw really bad
performance (which I didn't really expect with my nvram system), with -j32
builds:

Performance counter stats for 'make -j32 kernel/' (3 runs):

64.34 +- 39.22 seconds time elapsed ( +- 60.95% )
25.08 +- 0.142 seconds time elapsed ( +- 0.56% )
24.97 +- 0.072 seconds time elapsed ( +- 0.29% )

Unfortunately that outlier was on a vanilla v6.7 bootup.

As a next step I could try Linus's specific config, maybe there's some
detail in it that makes the difference.

The commit itself that Linus bisected to (9c0b4bb7f6303c) doesn't *seem*
wrong in itself, especially without uclamp [I presume Linus doesn't use
CONFIG_UCLAMP_TASK=y and the cpu.uclamp.min/uclamp.max cgroup interface
that goes with it?], but the commit changes how we use sched_util metrics,
which could change scheduling patterns - which is why I was spending many
hours yesterday and today trying to find a pathological workload to
reproduce this. No luck so far.

Linus: I can send a pull request for the 2-commit revert, or maybe you
could try Vincent's guess-patch that tries to restore to previous behavior
as closely as possible.

Thanks,

Ingo