[RFC PATCH 41/86] sched: handle resched policy in resched_curr()

From: Ankur Arora
Date: Tue Nov 07 2023 - 17:04:08 EST


One of the last ports of call before rescheduling is triggered
is resched_curr().

It's task is to set TIF_NEED_RESCHED and, if running locally,
either fold it in the preempt_count, or send a resched-IPI so
the target CPU folds it in.
To handle TIF_NEED_RESCHED_LAZY -- since the reschedule is not
imminent -- it only needs to set the appropriate bit.

Move all of underlying mechanism in __resched_curr(). And, define
resched_curr() which handles the policy on when we want to set
which need-resched variant.

For now the approach is to run to completion (TIF_NEED_RESCHED_LAZY)
with the following exceptions where we always want to reschedule
at the next preemptible point (TIF_NEED_RESCHED):

- idle: if we are polling in idle, then set_nr_if_polling() will do
the right thing. When not polling, we force TIF_NEED_RESCHED
and send a resched-IPI if needed.

- the target CPU is in userspace: run to completion semantics are
only for kernel tasks

- running under the full preemption model

Originally-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---
kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 70 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01df5ac2982c..f65bf3ce0e9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
}

/*
- * resched_curr - mark rq's current task 'to be rescheduled now'.
+ * __resched_curr - mark rq's current task 'to be rescheduled'.
*
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
+ * On UP this means the setting of the need_resched flag, on SMP, for
+ * eager resched it might also involve a cross-CPU call to trigger
+ * the scheduler on the target CPU.
*/
-void resched_curr(struct rq *rq)
+void __resched_curr(struct rq *rq, resched_t rs)
{
struct task_struct *curr = rq->curr;
int cpu;
@@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
cpu = cpu_of(rq);

if (cpu == smp_processor_id()) {
- set_tsk_need_resched(curr, RESCHED_eager);
- set_preempt_need_resched();
+ set_tsk_need_resched(curr, rs);
+ if (rs == RESCHED_eager)
+ set_preempt_need_resched();
return;
}

- if (set_nr_and_not_polling(curr, RESCHED_eager))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(curr, rs)) {
+ if (rs == RESCHED_eager)
+ smp_send_reschedule(cpu);
+ } else if (rs == RESCHED_eager)
trace_sched_wake_idle_without_ipi(cpu);
}

+/*
+ * resched_curr - mark rq's current task 'to be rescheduled' eagerly
+ * or lazily according to the current policy.
+ *
+ * Always schedule eagerly, if:
+ *
+ * - running under full preemption
+ *
+ * - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
+ * force TIF_NEED_RESCHED to be set and send a resched IPI.
+ * (the polling case has already set TIF_NEED_RESCHED via
+ * set_nr_if_polling()).
+ *
+ * - in userspace: run to completion semantics are only for kernel tasks
+ *
+ * Otherwise (regardless of priority), run to completion.
+ */
+void resched_curr(struct rq *rq)
+{
+ resched_t rs = RESCHED_lazy;
+ int context;
+
+ if (IS_ENABLED(CONFIG_PREEMPT) ||
+ (rq->curr->sched_class == &idle_sched_class)) {
+ rs = RESCHED_eager;
+ goto resched;
+ }
+
+ /*
+ * We might race with the target CPU while checking its ct_state:
+ *
+ * 1. The task might have just entered the kernel, but has not yet
+ * called user_exit(). We will see stale state (CONTEXT_USER) and
+ * send an unnecessary resched-IPI.
+ *
+ * 2. The user task is through with exit_to_user_mode_loop() but has
+ * not yet called user_enter().
+ *
+ * We'll see the thread's state as CONTEXT_KERNEL and will try to
+ * schedule it lazily. There's obviously nothing that will handle
+ * this need-resched bit until the thread enters the kernel next.
+ *
+ * The scheduler will still do tick accounting, but a potentially
+ * higher priority task waited to be scheduled for a user tick,
+ * instead of execution time in the kernel.
+ */
+ context = ct_state_cpu(cpu_of(rq));
+ if ((context == CONTEXT_USER) ||
+ (context == CONTEXT_GUEST)) {
+
+ rs = RESCHED_eager;
+ goto resched;
+ }
+
+resched:
+ __resched_curr(rq, rs);
+}
+
void resched_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);
--
2.31.1