Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

From: Xiongfeng Wang
Date: Tue Aug 22 2023 - 04:59:13 EST


(+Cc other colleagues who are testing the modification Thomas gave)

Kindly ping

Does Thomas's modification look all right ? I can help to send the patch.
Also other colleagues from my department are doing some stress tests base on
this modification.

Thanks,
Xiongfeng

On 2023/6/29 16:30, Vincent Guittot wrote:
> On Thu, 29 Jun 2023 at 00:01, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>>
>> On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
>>> On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>>>> No, because this is fundamentally wrong.
>>>>
>>>> If the CPU is on the way out, then the scheduler hotplug machinery
>>>> has to handle the period timer so that the problem Xiongfeng analyzed
>>>> does not happen in the first place.
>>>
>>> But the hrtimer was enqueued before it starts to offline the cpu
>>
>> It does not really matter when it was enqueued. The important point is
>> that it was enqueued on that outgoing CPU for whatever reason.
>>
>>> Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
>>> of the outgoing cpu but :
>>> - it must run on another target cpu to migrate the hrtimer.
>>> - it runs in the context of the caller which can be throttled.
>>
>> Sure. I completely understand the problem. The hrtimer hotplug callback
>> does not run because the task is stuck and waits for the timer to
>> expire. Circular dependency.
>>
>>>> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
>>>> timers, but let me look into whether we can migrate hrtimers early in
>>>> general.
>>>
>>> but for that we must check if the timer is enqueued on the outgoing
>>> cpu and we then need to choose a target cpu.
>>
>> You're right. I somehow assumed that cfs knows where it queued stuff,
>> but obviously it does not.
>
> scheduler doesn't know where hrtimer enqueues the timer
>
>>
>> I think we can avoid all that by simply taking that user space task out
>> of the picture completely, which avoids debating whether there are other
>> possible weird conditions to consider alltogether.
>
> yes, the offline sequence should not be impacted by the caller context
>
>>
>> Something like the untested below should just work.
>>
>> Thanks,
>>
>> tglx
>> ---
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
>> return err;
>> }
>>
>> +static long __cpu_device_down(void *arg)
>> +{
>> + struct device *dev = arg;
>> +
>> + return cpu_down(dev->id, CPUHP_OFFLINE);
>> +}
>> +
>> /**
>> * cpu_device_down - Bring down a cpu device
>> * @dev: Pointer to the cpu device to offline
>> @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
>> */
>> int cpu_device_down(struct device *dev)
>> {
>> - return cpu_down(dev->id, CPUHP_OFFLINE);
>> + unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
>> +
>> + if (cpu >= nr_cpu_ids)
>> + return -EBUSY;
>> +
>> + return work_on_cpu(cpu, __cpu_device_down, dev);
>
> The comment for work_on_cpu :
>
> * It is up to the caller to ensure that the cpu doesn't go offline.
> * The caller must not hold any locks which would prevent @fn from completing.
>
> make me wonder if this should be done only once the hotplug lock is
> taken so the selected cpu will not go offline
>
>> }
>>
>> int remove_cpu(unsigned int cpu)
> .
>