Re: [PATCH 00/17] sched: EEVDF using latency-nice

From: Phil Auld
Date: Tue Apr 25 2023 - 08:33:19 EST



Hi Peter,

On Tue, Mar 28, 2023 at 11:26:22AM +0200 Peter Zijlstra wrote:
> Hi!
>
> Latest version of the EEVDF [1] patches.
>
> Many changes since last time; most notably it now fully replaces CFS and uses
> lag based placement for migrations. Smaller changes include:
>
> - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
> bits on a system/cgroup based kernel build.
> - fixed a bunch of reweight / cgroup placement issues
> - adaptive placement strategy for smaller slices
> - rename se->lag to se->vlag
>
> There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> artificial/daft but who knows.
>
> The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> because it places things too far to the left in the tree. Basically it messes
> with the whole 'when', by placing a task back in history you're putting a
> burden on the now to accomodate catching up. More tinkering required.
>
> But over-all the thing seems to be fairly usable and could do with more
> extensive testing.

I had Jirka run his suite of perf workloads on this. These are macro benchmarks
on baremetal (NAS, SPECjbb etc). I can't share specific results because it
comes out in nice html reports on an internal website. There was no noticeable
performance change, which is a good thing. Overall performance was comparable
to CFS.

There was a win in stability though. A number of the error boxes across the
board were smaller. So less variance.

These are mostly performance/throughput tests. We're going to run some more
latency sensitive tests now.

So, fwiw, EEVDF is performing well on macro workloads here.



Cheers,
Phil

>
> [1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
>
> Results:
>
> hackbech -g $nr_cpu + cyclictest --policy other results:
>
> EEVDF CFS
>
> # Min Latencies: 00054
> LNICE(19) # Avg Latencies: 00660
> # Max Latencies: 23103
>
> # Min Latencies: 00052 00053
> LNICE(0) # Avg Latencies: 00318 00687
> # Max Latencies: 08593 13913
>
> # Min Latencies: 00054
> LNICE(-19) # Avg Latencies: 00055
> # Max Latencies: 00061
>
>
> Some preliminary results from Chen Yu on a slightly older version:
>
> schbench (95% tail latency, lower is better)
> =================================================================================
> case nr_instance baseline (std%) compare% ( std%)
> normal 25% 1.00 (2.49%) -81.2% (4.27%)
> normal 50% 1.00 (2.47%) -84.5% (0.47%)
> normal 75% 1.00 (2.5%) -81.3% (1.27%)
> normal 100% 1.00 (3.14%) -79.2% (0.72%)
> normal 125% 1.00 (3.07%) -77.5% (0.85%)
> normal 150% 1.00 (3.35%) -76.4% (0.10%)
> normal 175% 1.00 (3.06%) -76.2% (0.56%)
> normal 200% 1.00 (3.11%) -76.3% (0.39%)
> ==================================================================================
>
> hackbench (throughput, higher is better)
> ==============================================================================
> case nr_instance baseline(std%) compare%( std%)
> threads-pipe 25% 1.00 (<2%) -17.5 (<2%)
> threads-socket 25% 1.00 (<2%) -1.9 (<2%)
> threads-pipe 50% 1.00 (<2%) +6.7 (<2%)
> threads-socket 50% 1.00 (<2%) -6.3 (<2%)
> threads-pipe 100% 1.00 (3%) +110.1 (3%)
> threads-socket 100% 1.00 (<2%) -40.2 (<2%)
> threads-pipe 150% 1.00 (<2%) +125.4 (<2%)
> threads-socket 150% 1.00 (<2%) -24.7 (<2%)
> threads-pipe 200% 1.00 (<2%) -89.5 (<2%)
> threads-socket 200% 1.00 (<2%) -27.4 (<2%)
> process-pipe 25% 1.00 (<2%) -15.0 (<2%)
> process-socket 25% 1.00 (<2%) -3.9 (<2%)
> process-pipe 50% 1.00 (<2%) -0.4 (<2%)
> process-socket 50% 1.00 (<2%) -5.3 (<2%)
> process-pipe 100% 1.00 (<2%) +62.0 (<2%)
> process-socket 100% 1.00 (<2%) -39.5 (<2%)
> process-pipe 150% 1.00 (<2%) +70.0 (<2%)
> process-socket 150% 1.00 (<2%) -20.3 (<2%)
> process-pipe 200% 1.00 (<2%) +79.2 (<2%)
> process-socket 200% 1.00 (<2%) -22.4 (<2%)
> ==============================================================================
>
> stress-ng (throughput, higher is better)
> ==============================================================================
> case nr_instance baseline(std%) compare%( std%)
> switch 25% 1.00 (<2%) -6.5 (<2%)
> switch 50% 1.00 (<2%) -9.2 (<2%)
> switch 75% 1.00 (<2%) -1.2 (<2%)
> switch 100% 1.00 (<2%) +11.1 (<2%)
> switch 125% 1.00 (<2%) -16.7% (9%)
> switch 150% 1.00 (<2%) -13.6 (<2%)
> switch 175% 1.00 (<2%) -16.2 (<2%)
> switch 200% 1.00 (<2%) -19.4% (<2%)
> fork 50% 1.00 (<2%) -0.1 (<2%)
> fork 75% 1.00 (<2%) -0.3 (<2%)
> fork 100% 1.00 (<2%) -0.1 (<2%)
> fork 125% 1.00 (<2%) -6.9 (<2%)
> fork 150% 1.00 (<2%) -8.8 (<2%)
> fork 200% 1.00 (<2%) -3.3 (<2%)
> futex 25% 1.00 (<2%) -3.2 (<2%)
> futex 50% 1.00 (3%) -19.9 (5%)
> futex 75% 1.00 (6%) -19.1 (2%)
> futex 100% 1.00 (16%) -30.5 (10%)
> futex 125% 1.00 (25%) -39.3 (11%)
> futex 150% 1.00 (20%) -27.2% (17%)
> futex 175% 1.00 (<2%) -18.6 (<2%)
> futex 200% 1.00 (<2%) -47.5 (<2%)
> nanosleep 25% 1.00 (<2%) -0.1 (<2%)
> nanosleep 50% 1.00 (<2%) -0.0% (<2%)
> nanosleep 75% 1.00 (<2%) +15.2% (<2%)
> nanosleep 100% 1.00 (<2%) -26.4 (<2%)
> nanosleep 125% 1.00 (<2%) -1.3 (<2%)
> nanosleep 150% 1.00 (<2%) +2.1 (<2%)
> nanosleep 175% 1.00 (<2%) +8.3 (<2%)
> nanosleep 200% 1.00 (<2%) +2.0% (<2%)
> ===============================================================================
>
> unixbench (throughput, higher is better)
> ==============================================================================
> case nr_instance baseline(std%) compare%( std%)
> spawn 125% 1.00 (<2%) +8.1 (<2%)
> context1 100% 1.00 (6%) +17.4 (6%)
> context1 75% 1.00 (13%) +18.8 (8%)
> =================================================================================
>
> netperf (throughput, higher is better)
> ===========================================================================
> case nr_instance baseline(std%) compare%( std%)
> UDP_RR 25% 1.00 (<2%) -1.5% (<2%)
> UDP_RR 50% 1.00 (<2%) -0.3% (<2%)
> UDP_RR 75% 1.00 (<2%) +12.5% (<2%)
> UDP_RR 100% 1.00 (<2%) -4.3% (<2%)
> UDP_RR 125% 1.00 (<2%) -4.9% (<2%)
> UDP_RR 150% 1.00 (<2%) -4.7% (<2%)
> UDP_RR 175% 1.00 (<2%) -6.1% (<2%)
> UDP_RR 200% 1.00 (<2%) -6.6% (<2%)
> TCP_RR 25% 1.00 (<2%) -1.4% (<2%)
> TCP_RR 50% 1.00 (<2%) -0.2% (<2%)
> TCP_RR 75% 1.00 (<2%) -3.9% (<2%)
> TCP_RR 100% 1.00 (2%) +3.6% (5%)
> TCP_RR 125% 1.00 (<2%) -4.2% (<2%)
> TCP_RR 150% 1.00 (<2%) -6.0% (<2%)
> TCP_RR 175% 1.00 (<2%) -7.4% (<2%)
> TCP_RR 200% 1.00 (<2%) -8.4% (<2%)
> ==========================================================================
>
>
> ---
> Also available at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf
>
> ---
> Parth Shah (1):
> sched: Introduce latency-nice as a per-task attribute
>
> Peter Zijlstra (14):
> sched/fair: Add avg_vruntime
> sched/fair: Remove START_DEBIT
> sched/fair: Add lag based placement
> rbtree: Add rb_add_augmented_cached() helper
> sched/fair: Implement an EEVDF like policy
> sched: Commit to lag based placement
> sched/smp: Use lag to simplify cross-runqueue placement
> sched: Commit to EEVDF
> sched/debug: Rename min_granularity to base_slice
> sched: Merge latency_offset into slice
> sched/eevdf: Better handle mixed slice length
> sched/eevdf: Sleeper bonus
> sched/eevdf: Minimal vavg option
> sched/eevdf: Debug / validation crud
>
> Vincent Guittot (2):
> sched/fair: Add latency_offset
> sched/fair: Add sched group latency support
>
> Documentation/admin-guide/cgroup-v2.rst | 10 +
> include/linux/rbtree_augmented.h | 26 +
> include/linux/sched.h | 6 +
> include/uapi/linux/sched.h | 4 +-
> include/uapi/linux/sched/types.h | 19 +
> init/init_task.c | 3 +-
> kernel/sched/core.c | 65 +-
> kernel/sched/debug.c | 49 +-
> kernel/sched/fair.c | 1199 ++++++++++++++++---------------
> kernel/sched/features.h | 29 +-
> kernel/sched/sched.h | 23 +-
> tools/include/uapi/linux/sched.h | 4 +-
> 12 files changed, 794 insertions(+), 643 deletions(-)
>

--