Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

From: Vincent Guittot
Date: Wed Jun 28 2023 - 08:36:22 EST


On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> On Tue, Jun 27 2023 at 18:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <wangxiongfeng2@xxxxxxxxxx> wrote:
> >> > diff --cc kernel/sched/fair.c
> >> > index d9d6519fae01,bd6624353608..000000000000
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >> > {
> >> > lockdep_assert_held(&cfs_b->lock);
> >> >
> >> > - if (cfs_b->period_active)
> >> > + if (cfs_b->period_active) {
> >> > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >> > + int cpu = clock_base->cpu_base->cpu;
> >> > + if (!cpu_active(cpu) && cpu != smp_processor_id())
> >> > + hrtimer_start_expires(&cfs_b->period_timer,
> >> > HRTIMER_MODE_ABS_PINNED);
> >> > return;
> >> > + }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
> No, because this is fundamentally wrong.
>
> If the CPU is on the way out, then the scheduler hotplug machinery
> has to handle the period timer so that the problem Xiongfeng analyzed
> does not happen in the first place.

But the hrtimer was enqueued before it starts to offline the cpu
Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
of the outgoing cpu but :
- it must run on another target cpu to migrate the hrtimer.
- it runs in the context of the caller which can be throttled.

>
> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
> timers, but let me look into whether we can migrate hrtimers early in
> general.

but for that we must check if the timer is enqueued on the outgoing
cpu and we then need to choose a target cpu

>
> Aside of that the above is wrong by itself.
>
> if (cfs_b->period_active)
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>
> This only ends up on the outgoing CPU if either:
>
> 1) The code runs on the outgoing CPU
>
> or
>
> 2) The hrtimer is concurrently executing the hrtimer callback on the
> outgoing CPU.
>
> So this:
>
> if (cfs_b->period_active) {
> struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> int cpu = clock_base->cpu_base->cpu;
>
> if (!cpu_active(cpu) && cpu != smp_processor_id())
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> return;
> }
>
> only works, if
>
> 1) The code runs _not_ on the outgoing CPU
>
> and
>
> 2) The hrtimer is _not_ concurrently executing the hrtimer callback on
> the outgoing CPU.
>
> If the callback is executing (it spins on cfs_b->lock), then the
> timer is requeued on the outgoing CPU. Not what you want, right?
>
> Plus accessing hrtimer->clock_base->cpu_base->cpu lockless is fragile at
> best.
>
> Thanks,
>
> tglx