Re: [patch 1/2] nohz: only wakeup a single target cpu when kicking a task

From: Frederic Weisbecker
Date: Wed Oct 14 2020 - 19:40:58 EST


On Wed, Oct 14, 2020 at 10:33:21AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2020 at 02:13:28PM -0300, Marcelo Tosatti wrote:
>
> > > Yes but if the task isn't running, run_posix_cpu_timers() doesn't have
> > > anything to elapse. So indeed we can spare the IPI if the task is not
> > > running. Provided ordering makes sure that the task sees the new dependency
> > > when it schedules in of course.
> >
> > True.
> >
> > * p->on_cpu <- { 0, 1 }:
> > *
> > * is set by prepare_task() and cleared by finish_task() such that it will be
> > * set before p is scheduled-in and cleared after p is scheduled-out, both
> > * under rq->lock. Non-zero indicates the task is running on its CPU.
> >
> >
> > CPU-0 (tick_set_dep) CPU-1 (task switch)
> >
> > STORE p->tick_dep_mask
> > smp_mb() (atomic_fetch_or())
> > LOAD p->on_cpu
> >
> >
> > context_switch(prev, next)
> > STORE next->on_cpu = 1
> > ... [*]
> >
> > LOAD current->tick_dep_mask
> >
>
> That load is in tick_nohz_task_switch() right? (which BTW is placed
> completely wrong) You could easily do something like the below I
> suppose.
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 81632cd5e3b7..2a5fafe66bb0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -410,6 +410,14 @@ void __tick_nohz_task_switch(void)
> ts = this_cpu_ptr(&tick_cpu_sched);
>
> if (ts->tick_stopped) {
> + /*
> + * tick_set_dep() (this)
> + *
> + * STORE p->tick_dep_mask STORE p->on_cpu
> + * smp_mb() smp_mb()
> + * LOAD p->on_cpu LOAD p->tick_dep_mask
> + */
> + smp_mb();
> if (atomic_read(&current->tick_dep_mask) ||
> atomic_read(&current->signal->tick_dep_mask))
> tick_nohz_full_kick();

It would then need to be unconditional (whatever value of ts->tick_stopped).
Assuming the tick isn't stopped, we may well have an interrupt firing right
after schedule() which doesn't see the new value of tick_dep_map.

Alternatively, we could rely on p->on_rq which is set to TASK_ON_RQ_QUEUED
at wake up time, prior to the schedule() full barrier. Of course that doesn't
mean that the task is actually the one running on the CPU but it's a good sign,
considering that we are running in nohz_full mode and it's usually optimized
for single task mode.

Also setting a remote task's tick dependency is only used by posix cpu timer
in case the user has the bad taste to enqueue on a task running in nohz_full
mode. It shouldn't deserve an unconditional full barrier in the schedule path.

If the target is current, as is used by RCU, I guess we can keep a special
treatment.

> re tick_nohz_task_switch() being placed wrong, it should probably be
> placed before finish_lock_switch(). Something like so.
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cf044580683c..5c92c959824f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4084,6 +4084,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> vtime_task_switch(prev);
> perf_event_task_sched_in(prev, current);
> finish_task(prev);
> + tick_nohz_task_switch();
> finish_lock_switch(rq);
> finish_arch_post_lock_switch();
> kcov_finish_switch(current);
> @@ -4121,7 +4122,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> put_task_struct_rcu_user(prev);
> }
>
> - tick_nohz_task_switch();

IIRC, we wanted to keep it outside rq lock because it shouldn't it...