Re: [PATCH] sched/rt: Avoid sending an IPI to a CPU already doing a push

From: Steven Rostedt
Date: Thu Jun 30 2016 - 13:57:29 EST

Next message: Ray Jui: "Re: [RFC 5/7] net: ethernet: bgmac: Add platform device support"
Previous message: Geert Uytterhoeven: "Re: Boot failure on emev2/kzm9d (was: Re: [PATCH v2 11/11] mm/slab: lockless decision to grow cache)"
In reply to: Steven Rostedt: "[PATCH] sched/rt: Avoid sending an IPI to a CPU already doing a push"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Gentle ping...

-- Steve

On Fri, 24 Jun 2016 11:26:13 -0400
Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> When a CPU lowers its priority (schedules out a high priority task for a
> lower priority one), a check is made to see if any other CPU has overloaded
> RT tasks (more than one). It checks the rto_mask to determine this and if so
> it will request to pull one of those tasks to itself if the non running RT
> task is of higher priority than the new priority of the next task to run on
> the current CPU.
>
> When we deal with large number of CPUs, the original pull logic suffered
> from large lock contention on a single CPU run queue, which caused a huge
> latency across all CPUs. This was caused by only having one CPU having
> overloaded RT tasks and a bunch of other CPUs lowering their priority. To
> solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task
> push migration instead of pulling") changed the way to request a pull.
> Instead of grabbing the lock of the overloaded CPU's runqueue, it simply
> sent an IPI to that CPU to do the work.
>
> Although the IPI logic worked very well in removing the large latency build
> up, it still could suffer a large (although not as large as without the IPI)
> latency due to the work within the IPI. To understand this issue, an
> explanation of the IPI logic is required.
>
> When a CPU lowers its priority, it finds the next set bit in the rto_mask
> from its own CPU. That is, if bit 2 and 10 are set in the rto_mask, and CPU
> 8 lowers its priority, it will select CPU 10 to send its IPI to. Now, lets
> say that CPU 0 and CPU 1 lower their prioirty. They will both send their IPI
> to CPU 2.
>
> If IPI of CPU 0 gets to CPU 2 first, then it triggers the PUSH logic and if
> CPU 1 has a lower priority than CPU 0, it will push its overloaded task to
> CPU 1 (due to cpupri), even though the IPI came from CPU 0. Even though a
> task was pushed, we need to make sure that there's not higher tasks still
> waiting. Thus an IPI is then sent to CPU 10 for processing of CPU 0's
> request (remember the pushed task went to CPU 1).
>
> When the IPI of CPU 1 reaches CPU 2, it will skip the push logic (because it
> no longer has any tasks to push), but it too still needs to notify other
> CPUs about this CPU lowering its priority. Thus it sends another IPI to CPU
> 10, because that bit is still set in the rto_mask.
>
> Now on CPU 10, it just finished dealing with the IPI of CPU 8, and even
> though it now doesn't have any RT tasks to push, it just received two more
> IPIs (from CPU 2, to deal with CPU 0 and CPU 1). It too must do work to see
> if it should continue sending an IPI to more rto_mask CPUs. If there's no
> more CPUs to send to, it still needs to "stop" the execution of the push
> request.
>
> Although these IPIs are fast to process, I've traced a single CPU dealing
> with 89 IPIs in a row, on a 80 CPU machine! This was caused by an overloaded
> RT task that and a limited CPU affinity, and most of the CPUs sending IPIs
> to it, couldn't do anything with it. And because the CPUs were very active
> and changed their priorities again, it sent out duplicates. The latency of
> handling 89 IPIs was 200us (~2.3us to handle each IPI), as each IPI does
> require taking of a spinlock that deals with the IPI itself (not a rq lock,
> and very little contention).
>
> To solve this, an ipi_count is added to rt_rq, that gets incremented when an
> IPI is sent to that runqueue. When looking for the next CPU to process, the
> ipi_count is checked to see if that CPU is already processing push requests,
> and if so, then that CPU is skipped, and the next CPU in the rto_mask is
> checked.
>
> The IPI code now needs to call push_tasks() instead of just push_task() as
> it will not be receiving an IPI for each CPU that is requesting a PULL.
>
> This change removes this duplication of work in the IPI logic, and lowers
> the latency caused by the IPIs greatly.
>
> Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
> ---
> kernel/sched/rt.c | 14 ++++++++++++--
> kernel/sched/sched.h | 2 ++
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index d5690b722691..165bcfdbd94b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -100,6 +100,7 @@ void init_rt_rq(struct rt_rq *rt_rq)
> rt_rq->push_flags = 0;
> rt_rq->push_cpu = nr_cpu_ids;
> raw_spin_lock_init(&rt_rq->push_lock);
> + atomic_set(&rt_rq->ipi_count, 0);
> init_irq_work(&rt_rq->push_work, push_irq_work_func);
> #endif
> #endif /* CONFIG_SMP */
> @@ -1917,6 +1918,10 @@ static int find_next_push_cpu(struct rq *rq)
> break;
> next_rq = cpu_rq(cpu);
>
> + /* If pushing was already started, ignore */
> + if (atomic_read(&next_rq->rt.ipi_count))
> + continue;
> +
> /* Make sure the next rq can push to this rq */
> if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
> break;
> @@ -1955,6 +1960,7 @@ static void tell_cpu_to_push(struct rq *rq)
> return;
>
> rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
> + atomic_inc(&cpu_rq(cpu)->rt.ipi_count);
>
> irq_work_queue_on(&rq->rt.push_work, cpu);
> }
> @@ -1974,11 +1980,12 @@ static void try_to_push_tasks(void *arg)
>
> rq = cpu_rq(this_cpu);
> src_rq = rq_of_rt_rq(rt_rq);
> + WARN_ON_ONCE(!atomic_read(&rq->rt.ipi_count));
>
> again:
> if (has_pushable_tasks(rq)) {
> raw_spin_lock(&rq->lock);
> - push_rt_task(rq);
> + push_rt_tasks(rq);
> raw_spin_unlock(&rq->lock);
> }
>
> @@ -2000,7 +2007,7 @@ again:
> raw_spin_unlock(&rt_rq->push_lock);
>
> if (cpu >= nr_cpu_ids)
> - return;
> + goto out;
>
> /*
> * It is possible that a restart caused this CPU to be
> @@ -2011,7 +2018,10 @@ again:
> goto again;
>
> /* Try the next RT overloaded CPU */
> + atomic_inc(&cpu_rq(cpu)->rt.ipi_count);
> irq_work_queue_on(&rt_rq->push_work, cpu);
> +out:
> + atomic_dec(&rq->rt.ipi_count);
> }
>
> static void push_irq_work_func(struct irq_work *work)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de607e4febd9..b47d580dfa84 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -476,6 +476,8 @@ struct rt_rq {
> int push_cpu;
> struct irq_work push_work;
> raw_spinlock_t push_lock;
> + /* Used to skip CPUs being processed in the rto_mask */
> + atomic_t ipi_count;
> #endif
> #endif /* CONFIG_SMP */
> int rt_queued;

Next message: Ray Jui: "Re: [RFC 5/7] net: ethernet: bgmac: Add platform device support"
Previous message: Geert Uytterhoeven: "Re: Boot failure on emev2/kzm9d (was: Re: [PATCH v2 11/11] mm/slab: lockless decision to grow cache)"
In reply to: Steven Rostedt: "[PATCH] sched/rt: Avoid sending an IPI to a CPU already doing a push"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]