Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks

From: Ingo Molnar
Date: Mon Sep 25 2023 - 07:07:18 EST



* kernel test robot <oliver.sang@xxxxxxxxx> wrote:

> Hello,
>
> kernel test robot noticed a -19.0% regression of stress-ng.filename.ops_per_sec on:

Thanks for the testing, this is useful!

So I've tabulated the results into a much easier to read format:

> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds -5.3% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds -3.5% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression

And then sorted them along the regression/improvement axis:

> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds +5.3% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds +3.5% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement

Testing results notes:

- the '+' denotes an inverted improvement. The mixing of signs in the output of the
ktest robot is arguably confusing.

- Any hope getting similar summary format by default? It's much more informative than
just picking up the biggest regression, which wasn't even done correctly AFAICT.

Summary:

While there's a lot of improvements, it is primarily the nature of performance
regressions that dictate the way forward:

- stress-ng.sigsuspend.ops_per_sec regressions, -93%:

Clearly signal delivery performance hurts from delayed preemption, but
that should be straightforward to resolve, if we are willing to commit
to adding a high-prio insta-wakeup variant API ...

- stress-ng.exec.ops_per_sec -34% regression:

Likewise this possibly expresses that it's better to immediately reschedule
during exec() - but maybe it's more and reflects some unfavorable migration,
as suggested by the NUMA locality figures:

%change %stddev
| \
79317172 -34.2% 52217838 ± 3% numa-numastat.node0.local_node
79360983 -34.2% 52240348 ± 3% numa-numastat.node0.numa_hit
77971050 -33.2% 52068168 ± 3% numa-numastat.node1.local_node
78009071 -33.2% 52089987 ± 3% numa-numastat.node1.numa_hit
88287 -45.7% 47970 ± 2% vmstat.system.cs

- 'blogbench' regression of -59%:

It too has a very large reduction in context switches:

%stddev %change %stddev
\ | \
30035 -49.7% 15097 ± 3% vmstat.system.cs
2243545 ± 2% -4.1% 2152228 blogbench.read_score
52412617 -28.3% 37571769 blogbench.time.file_system_outputs
2682930 -74.1% 694136 blogbench.time.involuntary_context_switches
2369329 -50.0% 1184098 ± 5% blogbench.time.voluntary_context_switches
5851 -35.9% 3752 ± 2% blogbench.write_score

It's unclear to me what's happening with this one, just from these stats,
but it's "write_score" that hurts most.

- 'stress-ng.filename.ops_per_sec' regression of -19%:

This test suffered from an *increase* in context-switching, and a large
increase in CPU-idle:

%stddev %change %stddev
\ | \
4641666 +19.5% 5545394 ± 2% cpuidle..usage
90589 ± 2% +70.5% 154471 ± 2% vmstat.system.cs
628439 -19.2% 507711 stress-ng.filename.ops
10317 -19.0% 8355 stress-ng.filename.ops_per_sec

171981 -59.7% 69333 ± 3% stress-ng.time.involuntary_context_switches
770691 ± 3% +200.9% 2319214 stress-ng.time.voluntary_context_switches

Anyway, it's clear from these results that while many workloads hurt
from our notion of wake-preemption, there's several ones that benefit
from it, especially generic ones like phoronix-test-suite - which have
no good way to turn off wakeup preemption (SCHED_BATCH might help though).

One way to approach this would be to instead of always doing
wakeup-preemption (our current default), we could turn it around and
only use it when it is clearly beneficial - such as signal delivery,
or exec().

The canonical way to solve this would be give *userspace* a way to
signal that it's beneficial to preempt immediately, ie. yield(),
but right now that interface is hurting tasks that only want to
give other tasks a chance to run, without necessarily giving up
their own right to run:

se->deadline += calc_delta_fair(se->slice, se);

Anyway, my patch is obviously a no-go as-is, and this clearly needs more work.

Thanks,

Ingo