Re: [PATCH v5 7/7] sched/fair: Fair server interface

From: Joel Fernandes
Date: Tue Jan 23 2024 - 10:47:52 EST


Hi Daniel,

On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote:
> On 1/19/24 02:55, Joel Fernandes wrote:
>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>>> Add an interface for fair server setup on debugfs.
>>>
>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>
>>> - fair_server_runtime: set runtime in ns
>>> - fair_server_period: set period in ns
>>> - fair_server_defer: on/off for the defer mechanism
>>>
>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
>>
>> Hi Daniel, Peter,
>> I am writing on behalf of the ChromeOS scheduler team.
>>
>> We had to revert the last 3 patches in this series because of a syzkaller
>> reported bug, this happens on the sched/more branch in Peter's tree:
>>
>> WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
>> place_entity+0x240/0x290 kernel/sched/fair.c:5147
>> Call Trace:
>> <TASK>
>> enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
>> enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
>> enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
>> activate_task+0x60/0xc0 kernel/sched/core.c:2147
>> ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
>> ttwu_queue kernel/sched/core.c:4047 [inline]
>> try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
>> kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
>> __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
>> queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
>> queue_delayed_work include/linux/workqueue.h:577 [inline]
>> srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
>>
>> which is basically this warning in place_entity:
>> if (WARN_ON_ONCE(!load))
>> load = 1;
>>
>> Full log (scroll to the bottom as there is console/lockdep side effects which
>> are likely not relevant to this issue): https://paste.debian.net/1304579/
>>
>> Side note, we are also looking into a KASAN nullptr deref but this happens
>> only on our backport of the patches to a 5.15 kernel, as far as we know.
>>
>> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
>> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
>> RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline]
>> RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
>> [...]
>> Call Trace:
>> <TASK>
>> set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
>> set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
>> set_next_task kernel/sched/sched.h:2241 [inline]
>> pick_next_task kernel/sched/core.c:6014 [inline]
>> __schedule+0x36fb/0x402d kernel/sched/core.c:6378
>> preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
>> preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
>>
>> Full splat: https://paste.debian.net/1304573/
>
> Interesting, does it keep any task hung? I am having a case where I see
> a hung task, but I do not get the splat because the system freezes (printk
> with rq_lock I guess)...
>
> It might be the same problem.

Ah, we have an update. Suleiman found this is happening because of core
scheduling's pick logic. I have some patches to fix it, there's also more fixes
we have on other issues. Will coordinate with the team to send these out soon.
We are currently testing them more.

>> Investigation is on going but could you also please take a look at these? It
>> is hard to reproduce and only syzbot has luck reproducing these.
>>
>> Also I had a comment below:
>>
>>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>>> +{
>>> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>>> + u64 new_bw = to_ratio(period, runtime);
>>> + struct rq *rq = dl_se->rq;
>>> + int cpu = cpu_of(rq);
>>> + struct dl_bw *dl_b;
>>> + unsigned long cap;
>>> + int retval = 0;
>>> + int cpus;
>>> +
>>> + dl_b = dl_bw_of(cpu);
>>> + raw_spin_lock(&dl_b->lock);
>>> + cpus = dl_bw_cpus(cpu);
>>> + cap = dl_bw_capacity(cpu);
>>> +
>>> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
>>
>> The dl_overflow() call here seems introducing an issue with our conceptual
>> understanding of how the dl server is supposed to work.
>>
>> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
>> Suppose the DL server params are 50ms runtime in 100ms period (basically we
>> want to dedicate 50% of the bandwidth of each CPU to CFS).
>>
>> In such a situation, __dl_overflow() will return an error right? Because
>> total bandwidth will exceed 100% (4 times 50% is 200%).
>
> I might be missing something in your case, but, it accepts:
>
> root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
>
> your system accepts 400%... the percentage is "global".
>
> is it failing in your system?

You are right, I was actually trying to change it manually in my kernel in
dl_server_start(). In this case dlserver_apply_server_params() gets init=1 and
old_bw is 0.

I tried using the debugfs, and that works. So I think we will just use the
debugfs. I was being lazy and setting it in my kernel manually for testing like
this:

@@ -1475,7 +1475,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
* this before getting generic.
*/
if (!dl_server(dl_se)) {
u64 runtime = 12 * NSEC_PER_MSEC;
u64 period = 15 * NSEC_PER_MSEC;

That doesn't work but I tried debugfs and it works. But for production, we will
set it from userspace so it should not be an issue.

I feel so much better now :) Thanks Daniel.

By the way, what's the plan on remaining patches in sched/more branch, are you
planning to resend those later? If so, we can just post our fixes on top of
that, and if you don't mind you could include it in your next series posting
(sched/more + our fixes + your fixes).

Thanks!

- Joel