Re: Possible lock-less list race in scheduler_ipi()

From: Paul E. McKenney
Date: Fri Mar 06 2015 - 12:10:11 EST


On Thu, Mar 05, 2015 at 11:48:51PM +0000, Mathieu Desnoyers wrote:
> Hi,
>
> I recently reviewed the scheduler_ipi() code by
> coincidence, and noticed that the use of llist.h
> in there seemed a bit odd. So I'd like to have
> more eyes to help me look into this.
>
> I've been told that there has been issues with
> IPIs lately, so here is one possible scenario
> I think might be possible:
>
> - scheduler_ipi()
> - sched_ttwu_pending()
> - llist_del_all()
> (grabs the wake_list with xchg())
> - raw_spin_lock_irqsave(&rq->lock, flags);
> - iteration on the llist (owned locally by interrupt handler)
> (for each item)
> p = llist_entry(llist, struct task_struct, wake_entry);
> llist = llist_next(llist);
> ttwu_do_activate(rq, p, 0);
> - raw_spin_unlock_irqrestore(&rq->lock, flags);
>
> ttwu_do_activate() ends up calling
> ttwu_activate() and ttwu_do_wakeup(),
>
> I'm worried that ttwu_do_wakeup() might end up
> indirectly handing off the task to the following
> functions (within another execution context):
>
> - try_to_wake_up()
> - ttwu_queue()
> - ttwu_queue_remote adds the process to another
> wake_list with a cmpxchg()
>
> llist_next() is pretty simple:
>
> static inline struct llist_node *llist_next(struct llist_node *node)
> {
> return node->next;
> }
>
> It is so simple that I wonder if the compiler would be
> within its rights to reorder the load of node->next
> after some operations within ttwu_do_activate(), thus
> causing corruption of this linked-list due to a
> concurrent try_to_wake_up() performed by another core.

Well, it clearly cannot leak this load out of the critical section.
That said, there is nothing in llist_next() that tells the compiler
that it need to be careful with this load.

And all three functions, sched_ttwu_pending(), ttwu_do_activate(),
and llist_next(), are in the same compilation unit. The compiler
therefore has full knowledge, and full ability to rearrange. It can
slide this load down past the CONFIG_SMP #ifdef in ttwu_do_activate()
and into ttwu_activate(). However, it cannot push it past the call to
sched_clock_cpu(), which is defined in the separate compilation unit
kernel/sched/clock.c. And sched_clock_cpu() is invoked from
update_rq_clock(), which is invoked from enqueue_task(), which is invoked
from activate_task(), which is invoked from ttwu_activate().
And even if sched_clock_cpu() was in the same compilation unit, its
preempt_disable_notrace() keeps the compiler from reordering. At least
it does if the compiler cannot prove that sched_clock_cpu() takes one
of two early exits. ;-)

So I am not seeing anything important that the compiler could reorder
this past. The CPU might do additional reordering, of course.

> Am I too paranoid about the possible compiler mishaps
> there, or are my concerns justified ?

Now try_to_wake_up() does acquire p->pi_lock, but that of course does
not exclude sched_ttwu_pending(), which instead acquires rq->lock.

And in a virtualized environment, sched_ttwu_pending() can be preempted
and delayed pretty much arbitrarily pretty much anywhere, even when its
interrupts are disabled.

So let's look at the llist code. One question is of course "How does
llist_add_batch() avoid the ABA problem?" In this case, the answer
appears to be that llist_del_all() is used, and never llist_del_first().
(Mixing use of llist_del_all() and llist_del_first() by multiple
threads could cause this ABA problem, even if only one thread at a time
used llist_del_first().)

The code does appear to assume that if a given task is on the local
llist, that it cannot be awakened any other way without acquiring the
same runqueue lock that sched_ttwu_pending() holds. If this rule was
violated, then of course the list could be corrupted. However, the
only accesses to the list appear to be sched_ttwu_pending()'s
llist_del_all(), ttwu_queue_remote()'s llist_add(), and a couple
of llist_empty() calls. This combination appears to me to be safe.

The scheduler_ipi() code assumes that an IPI cannot outrun data.
If this is violated, the scheduler_ipi() might still see an
empty ->wake_list() even though the sender of the IPI just filled it.
The last time I worked with a machine that violated this assumption
was about twenty years ago, and I very much hope that such hardware
designs have not come back from the dead. But this would cause a
hang, not a corrupt list.

So, does this trigger any helpful thoughts?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/