Re: [PATCH v5 7/7] sched/fair: Fair server interface

From: Joel Fernandes
Date: Thu Jan 18 2024 - 20:49:17 EST


On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
>
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>
> - fair_server_runtime: set runtime in ns
> - fair_server_period: set period in ns
> - fair_server_defer: on/off for the defer mechanism
>
> Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>

Hi Daniel, Peter,
I am writing on behalf of the ChromeOS scheduler team.

We had to revert the last 3 patches in this series because of a syzkaller
reported bug, this happens on the sched/more branch in Peter's tree:

WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
place_entity+0x240/0x290 kernel/sched/fair.c:5147
Call Trace:
<TASK>
enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
activate_task+0x60/0xc0 kernel/sched/core.c:2147
ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
ttwu_queue kernel/sched/core.c:4047 [inline]
try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
__queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
queue_delayed_work include/linux/workqueue.h:577 [inline]
srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]

which is basically this warning in place_entity:
if (WARN_ON_ONCE(!load))
load = 1;

Full log (scroll to the bottom as there is console/lockdep side effects which
are likely not relevant to this issue): https://paste.debian.net/1304579/

Side note, we are also looking into a KASAN nullptr deref but this happens
only on our backport of the patches to a 5.15 kernel, as far as we know.

KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline]
RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
[...]
Call Trace:
<TASK>
set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
set_next_task kernel/sched/sched.h:2241 [inline]
pick_next_task kernel/sched/core.c:6014 [inline]
__schedule+0x36fb/0x402d kernel/sched/core.c:6378
preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615

Full splat: https://paste.debian.net/1304573/

Investigation is on going but could you also please take a look at these? It
is hard to reproduce and only syzkaller's syzbot has luck so for reproducing
these.

Also I had a comment below:

> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> + u64 new_bw = to_ratio(period, runtime);
> + struct rq *rq = dl_se->rq;
> + int cpu = cpu_of(rq);
> + struct dl_bw *dl_b;
> + unsigned long cap;
> + int retval = 0;
> + int cpus;
> +
> + dl_b = dl_bw_of(cpu);
> + raw_spin_lock(&dl_b->lock);
> + cpus = dl_bw_cpus(cpu);
> + cap = dl_bw_capacity(cpu);
> +
> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {

The dl_overflow() call here seems introducing an issue with our conceptual
understanding of how the dl server is supposed to work.

Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
Suppose the DL server params are 50ms runtime in 100ms period (basically we
want to dedicate 50% of the bandwidth of each CPU to CFS).

In such a situation, __dl_overflow() will return an error right? Because
total bandwidth will exceed 100% (4 times 50% is 200%).

Further, this complicates the setting of the parameters since it means we
have to check the number of CPUs in advance and then set the parameters to
prevent dl_overflow(). As an example of this, 30% (runtime / period) for each
CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work
because __dl_overflow() will fail.

How do you suggest we remedy this? Can we make the dlserver calculate how
much bandwidth is allowed on a per-CPU basis? My understanding is the fake
dl_se are all pinned to their respective CPUs, so we don't have the same
requirement as real DL tasks which may migrate freely within the root domain.

thanks,

- Joel