schbench v1.0

From: Chris Mason
Date: Mon Apr 17 2023 - 04:10:55 EST

Next message: Christian Brauner: "Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes"
Previous message: Krzysztof Kozlowski: "[PATCH v2 4/4] arm64: dts: qcom: sc8280xp: correct GIC child node name"
Next in thread: Peter Zijlstra: "Re: schbench v1.0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi everyone,

Since we've been doing a lot of scheduler benchmarking lately, I wanted
to dust off schbench and see if I could make it more accurately model
the results we're seeing from production workloads.

I've reworked a few things and since it's somewhat different now I went
ahead and tagged v1.0:

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

I also tossed in a README.md, which documents the arguments.

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/tree/README.md

The original schbench focused almost entirely on wakeup latencies, which
is still included in the output now. Instead of spinning for a fixed
amount of wall time, v1.0 now uses a loop of matrix multiplication to
simulate a web request.

David Vernet recently benchmarked EEVDF, CFS, and sched_ext against
production workloads:

https://lore.kernel.org/lkml/20230411020945.GA65214@maniforge/

And what we see in general is that involuntary context switches trigger
a basket of expensive interactions between CPU/memory/disk. This is
pretty difficult to model from a benchmark targeting just the scheduler,
so instead of making a much bigger simulation of the workload, I made
preemption slower inside of schbench. In terms of performance he found:

EEVDF < CFS < CFS shared wake queue < sched_ext BPF

My runs with schbench match his percentage differences pretty closely.

The least complicated way I could find to penalize preemption is to use
a per-cpu spinlock around the matrix math. This can be disabled with
(-L/--no-locking). The results map really well to our production
workloads, which don't use spinlocks, but do get hit with major page
faults when they lose the CPU in the middle of a request.

David has more schbench examples for his presentation at OSPM, but
here's some annotated output:

schbench -F128 -n 10
Wakeup Latencies percentiles (usec) runtime 90 (s) (370488 total samples)
50.0th: 9 (69381 samples)
90.0th: 24 (134753 samples)
* 99.0th: 1266 (32796 samples)
99.9th: 4712 (3322 samples)
min=1, max=12449

This is basically the important part of the original schbench. It's the
time from when a worker thread is woken to when it starts running.

Request Latencies percentiles (usec) runtime 90 (s) (370983 total samples)
50.0th: 11440 (103738 samples)
90.0th: 12496 (120020 samples)
* 99.0th: 22304 (32498 samples)
99.9th: 26336 (3308 samples)
min=5818, max=57747

RPS percentiles (requests) runtime 90 (s) (9 total samples)
20.0th: 4312 (3 samples)
* 50.0th: 4376 (3 samples)
90.0th: 4440 (3 samples)
min=4290, max=4446

Request latency and RPS are both new. The original schbench had
requests, but they were based on wall clock spinning instead of a fixed
amount of CPU work. The new requests include two small usleeps() and
the matrix math in their timing.

Generally for production the 99th percentile latencies are most
important. For RPS, I watch 20th and 50th percentile more. The readme
linked above talks through the command line options and how to pick a
good numbers.

I did some runs with different parameters comparing Linus git and EEVDF:

Comparing EEVDF (8c59a975d5ee) With Linus 6.3-rc6ish (a7a55e27ad72)

schbench -F128 -N <val> with and without -L
Single socket Intel cooperlake CPUs, turbo disabled

F128 N1 EEVDF Linus
Wakeup (usec): 99.0th: 355 555
Request (usec): 99.0th: 2,620 1,906
RPS (count): 50.0th: 37,696 41,664

F128 N1 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 295 545
Request (usec): 99.0th: 1,890 1,758
RPS (count): 50.0th: 37,824 41,920

F128 N10 EEVDF Linus
Wakeup (usec): 99.0th: 755 1,266
Request (usec): 99.0th: 25,632 22,304
RPS (count): 50.0th: 4,280 4,376

F128 N10 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 823 1,118
Request (usec): 99.0th: 17,184 14,192
RPS (count): 50.0th: 4,440 4,456

F128 N20 EEVDF Linus
Wakeup (usec): 99.0th: 901 1,806
Request (usec): 99.0th: 51,136 46,016
RPS (count): 50.0th: 2,132 2,196

F128 N20 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 905 1,902
Request (usec): 99.0th: 32,832 30,496
RPS (count): 50.0th: 2,212 2,212

In general this shows us that EEVDF is a huge improvement on wakeup
latency, but we pay for it with preemptions during the request itself.
Diving into the F128 N10 no-locking numbers:

F128 N10 no-locking EEVDF Linus
Wakeup (usec): 99.0th: 823 1,118
Request (usec): 99.0th: 17,184 14,192
RPS (count): 50.0th: 4,440 4,456

EEVDF is very close in terms of RPS. The p99 request latency shows the
preemptions pretty well, but the p50 request latency numbers have EEVDF
winning slightly (11,376 usec eevdf vs 11,408 usec on -linus).

-chris

Next message: Christian Brauner: "Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes"
Previous message: Krzysztof Kozlowski: "[PATCH v2 4/4] arm64: dts: qcom: sc8280xp: correct GIC child node name"
Next in thread: Peter Zijlstra: "Re: schbench v1.0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]