Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

From: Vincent Guittot
Date: Thu Jun 29 2023 - 04:40:38 EST


On Thu, 29 Jun 2023 at 03:26, Xiongfeng Wang <wangxiongfeng2@xxxxxxxxxx> wrote:
>
>
>
> On 2023/6/28 0:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <wangxiongfeng2@xxxxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> Kindly ping~
> >> Could you please take a look at this issue and the below temporary fix ?
> >>
> >> Thanks,
> >> Xiongfeng
> >>
> >> On 2023/6/12 20:49, Xiongfeng Wang wrote:
> >>>
> >>>
> >>> On 2023/6/9 22:55, Thomas Gleixner wrote:
> >>>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> >>>>
> >>>> Cc+ scheduler people, leave context intact
> >>>>
> >>>>> Hello,
> >>>>> When I do some low power tests, the following hung task is printed.

[...]

> >>> diff --cc kernel/sched/fair.c
> >>> index d9d6519fae01,bd6624353608..000000000000
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >>> {
> >>> lockdep_assert_held(&cfs_b->lock);
> >>>
> >>> - if (cfs_b->period_active)
> >>> + if (cfs_b->period_active) {
> >>> + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >>> + int cpu = clock_base->cpu_base->cpu;
> >>> + if (!cpu_active(cpu) && cpu != smp_processor_id())
> >>> + hrtimer_start_expires(&cfs_b->period_timer,
> >>> HRTIMER_MODE_ABS_PINNED);
> >>> return;
> >>> + }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
>
> Sorry, I forgot to provide the kernel modification to help reproduce the issue.
> At first, the issue can only be reproduced on the product environment with
> product stress testcase. After firguring out the reason, I add the following
> modification. It make sure the process ran out cfs quota and can be sched out in
> free_vm_stack_cache. Although the real schedule point is in __vunmap(), this can
> also show the issue exists.

I have been able to reproduce the problem ( or at least something
similar) without your change below with a shorter cfs_quota_us and
other tasks always running in the cgroup

>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0fb86b65ae60..3b2d83fb407a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -110,6 +110,8 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/task.h>
>
> +#include <linux/delay.h>
> +
> /*
> * Minimum number of threads to boot the kernel
> */
> @@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
> struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
> int i;
>
> + mdelay(2000);
> + cond_resched();
> +
> for (i = 0; i < NR_CACHED_STACKS; i++) {
> struct vm_struct *vm_stack = cached_vm_stacks[i];
>
> Thanks,
> Xiongfeng
>
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
> >
> >
> >>>
> >>> cfs_b->period_active = 1;
> >>> -
> >>> hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> >>> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >>> }
> >>>
> >>> Thanks,
> >>> Xiongfeng
> >>>
> >>> .
> >>>
> > .
> >