Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()

From: Ankur Arora
Date: Wed Nov 08 2023 - 05:28:15 EST



Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:

> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>> }
>>
>> /*
>> - * resched_curr - mark rq's current task 'to be rescheduled now'.
>> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>> *
>> - * On UP this means the setting of the need_resched flag, on SMP it
>> - * might also involve a cross-CPU call to trigger the scheduler on
>> - * the target CPU.
>> + * On UP this means the setting of the need_resched flag, on SMP, for
>> + * eager resched it might also involve a cross-CPU call to trigger
>> + * the scheduler on the target CPU.
>> */
>> -void resched_curr(struct rq *rq)
>> +void __resched_curr(struct rq *rq, resched_t rs)
>> {
>> struct task_struct *curr = rq->curr;
>> int cpu;
>> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>> cpu = cpu_of(rq);
>>
>> if (cpu == smp_processor_id()) {
>> - set_tsk_need_resched(curr, RESCHED_eager);
>> - set_preempt_need_resched();
>> + set_tsk_need_resched(curr, rs);
>> + if (rs == RESCHED_eager)
>> + set_preempt_need_resched();
>> return;
>> }
>>
>> - if (set_nr_and_not_polling(curr, RESCHED_eager))
>> - smp_send_reschedule(cpu);
>> - else
>> + if (set_nr_and_not_polling(curr, rs)) {
>> + if (rs == RESCHED_eager)
>> + smp_send_reschedule(cpu);
>
> I think you just broke things.
>
> Not all idle threads have POLLING support, in which case you need that
> IPI to wake them up, even if it's LAZY.

Yes, I was concerned about that too. But doesn't this check against the
idle_sched_class in resched_curr() cover that?

>> + if (IS_ENABLED(CONFIG_PREEMPT) ||
>> + (rq->curr->sched_class == &idle_sched_class)) {
>> + rs = RESCHED_eager;
>> + goto resched;

>> + } else if (rs == RESCHED_eager)
>> trace_sched_wake_idle_without_ipi(cpu);
>> }
>
>
>
>>
>> +/*
>> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly
>> + * or lazily according to the current policy.
>> + *
>> + * Always schedule eagerly, if:
>> + *
>> + * - running under full preemption
>> + *
>> + * - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
>> + * force TIF_NEED_RESCHED to be set and send a resched IPI.
>> + * (the polling case has already set TIF_NEED_RESCHED via
>> + * set_nr_if_polling()).
>> + *
>> + * - in userspace: run to completion semantics are only for kernel tasks
>> + *
>> + * Otherwise (regardless of priority), run to completion.
>> + */
>> +void resched_curr(struct rq *rq)
>> +{
>> + resched_t rs = RESCHED_lazy;
>> + int context;
>> +
>> + if (IS_ENABLED(CONFIG_PREEMPT) ||
>> + (rq->curr->sched_class == &idle_sched_class)) {
>> + rs = RESCHED_eager;
>> + goto resched;
>> + }
>> +
>> + /*
>> + * We might race with the target CPU while checking its ct_state:
>> + *
>> + * 1. The task might have just entered the kernel, but has not yet
>> + * called user_exit(). We will see stale state (CONTEXT_USER) and
>> + * send an unnecessary resched-IPI.
>> + *
>> + * 2. The user task is through with exit_to_user_mode_loop() but has
>> + * not yet called user_enter().
>> + *
>> + * We'll see the thread's state as CONTEXT_KERNEL and will try to
>> + * schedule it lazily. There's obviously nothing that will handle
>> + * this need-resched bit until the thread enters the kernel next.
>> + *
>> + * The scheduler will still do tick accounting, but a potentially
>> + * higher priority task waited to be scheduled for a user tick,
>> + * instead of execution time in the kernel.
>> + */
>> + context = ct_state_cpu(cpu_of(rq));
>> + if ((context == CONTEXT_USER) ||
>> + (context == CONTEXT_GUEST)) {
>> +
>> + rs = RESCHED_eager;
>> + goto resched;
>> + }
>
> Like said, this simply cannot be. You must not rely on the remote CPU
> being in some state or not. Also, it's racy, you could observe USER and
> then it enters KERNEL.

Or worse. We might observe KERNEL and it enters USER.

I think we would be fine if we observe USER: it would be upgrade
to RESCHED_eager and send an unnecessary IPI.

But if we observe KERNEL and it enters USER, then we will have
set the need-resched-lazy bit which the thread might not see
(it might have left exit_to_user_mode_loop()) until the next
entry to the kernel.

But, yes I would like to avoid the ct_state as well. But
need-resched-lazy only makes sense when the task on the runqueue
is executing in the kernel...

--
ankur