Re: [PATCH -v2 07/17] sched: Fix hotplug vs CPU bandwidth control

From: Dietmar Eggemann
Date: Fri Oct 09 2020 - 16:41:19 EST


On 05/10/2020 16:57, Peter Zijlstra wrote:
> Since we now migrate tasks away before DYING, we should also move
> bandwidth unthrottle, otherwise we can gain tasks from unthrottle
> after we expect all tasks to be gone already.
>
> Also; it looks like the RT balancers don't respect cpu_active() and
> instead rely on rq->online in part, complete this. This too requires
> we do set_rq_offline() earlier to match the cpu_active() semantics.
> (The bigger patch is to convert RT to cpu_active() entirely)
>
> Since set_rq_online() is called from sched_cpu_activate(), place
> set_rq_offline() in sched_cpu_deactivate().

[ 76.215229] WARNING: CPU: 1 PID: 1913 at kernel/irq_work.c:95
irq_work_queue_on+0x108/0x110
[ 76.223589] Modules linked in:
[ 76.226646] CPU: 1 PID: 1913 Comm: task0-1 Not tainted
5.9.0-rc1-00159-g231df48234cb-dirty #40
[ 76.235268] Hardware name: ARM Juno development board (r0) (DT)
[ 76.241194] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--)
[ 76.246772] pc : irq_work_queue_on+0x108/0x110
[ 76.251220] lr : pull_rt_task+0x58/0x68
[ 76.255577] sp : ffff800013a83be0
[ 76.258890] x29: ffff800013a83be0 x28: ffff000972f34600
[ 76.264208] x27: ffff000972f34b90 x26: ffff8000114156c0
[ 76.269524] x25: 0080000000000000 x24: ffff800011cd7000
[ 76.274840] x23: ffff000972f34600 x22: ffff800010d627c8
[ 76.280157] x21: 0000000000000000 x20: 0000000000000000
[ 76.285473] x19: ffff00097ef701c0 x18: 0000000000000010
[ 76.290788] x17: 0000000000000000 x16: 0000000000000000
[ 76.296104] x15: ffff000972f34a80 x14: ffffffffffffffff
[ 76.301420] x13: ffff800093a83987 x12: ffff800013a8398f
[ 76.306736] x11: ffff800011ac2000 x10: ffff800011ce8690
[ 76.312051] x9 : 0000000000000000 x8 : ffff800011ce9000
[ 76.317367] x7 : ffff8000106e9bb8 x6 : 0000000000000144
[ 76.322682] x5 : 0000000000000000 x4 : ffff800011aaa1c0
[ 76.327998] x3 : 0000000000000000 x2 : 0000000000000000
[ 76.333314] x1 : ffff800011ce72a0 x0 : 0000000000000002
[ 76.338630] Call trace:
[ 76.341076] irq_work_queue_on+0x108/0x110
[ 76.349185] pull_rt_task+0x58/0x68
[ 76.352673] balance_rt+0x84/0x88
[ 76.355990] __schedule+0x4a4/0x670
[ 76.359478] schedule+0x70/0x108
[ 76.362706] do_nanosleep+0x8c/0x178
[ 76.366283] hrtimer_nanosleep+0xa0/0x118
[ 76.370294] common_nsleep_timens+0x68/0x98
[ 76.374479] __arm64_sys_clock_nanosleep+0xc0/0x138
[ 76.379361] el0_svc_common.constprop.0+0x6c/0x168
[ 76.384155] do_el0_svc+0x24/0x90
[ 76.387471] el0_sync_handler+0x90/0x198
[ 76.391394] el0_sync+0x158/0x180


balance_rt() checks via need_pull_rt_task() that rq is online but it
looks like that with RT_PUSH_IPI pull_rt_task() -> tell_cpu_to_push()
calls irq_work_queue_on() with cpu = rto_next_cpu(rq->rd) and this one
can be offline here as well now.