Re: [patch 2/3] scheduler: add full memory barriers upon taskswitch at runqueue lock/unlock

From: Mathieu Desnoyers
Date: Mon Feb 01 2010 - 12:45:27 EST


* Linus Torvalds (torvalds@xxxxxxxxxxxxxxxxxxxx) wrote:
>
>
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> >
> > What we have to be careful about here is that it's not enough to just
> > rely on switch_mm() containing a memory barrier. What we really need to
> > enforce is that switch_mm() issues memory barriers both _before_ and
> > _after_ mm_cpumask modification. The "after" part is usually dealt with
> > by the TLB context switch, but the "before" part usually isn't.
>
> Ok, whatever. I vote for not doing anything at all, because this just all
> sounds like some totally crazy crap. You haven't explained the actual
> races, you haven't explained anything at all, you're apparently just
> randomly sprinkling smp_mb's around until you think it's all fine.
>
> Show the actual memory ordering constraints as it is related to the kernel
> data structures. I'm totally uninterested in general handwaving and "we
> must have smp_mb's here and here" without very explicit explanations of
> exactly WHAT the memory orderings are.

It's probably worthwhile to summarize here the mb-related discussions we
had in the previous RFC rounds.

Here is the detailed execution scenario showing the race. The race with
unlock showing that we need to order user-space loads before mm_cpumask
store. This is showing an execution sequence involving the userspace RCU
library and the Linux kernel with sys_membarrier(). As we see below, the
lack of ordering between "cpumask_clear" and user-space execution
creates a race with the scheduler where sys_membarrier() incorrectly
considers CPU 1 as not needing a membarrier_ipi.


CPU 0 CPU 1
-------------- --------------
<user space> <user space> (already in read-side C.S.)
obj = rcu_dereference(list->next);
-> load list->next
copy = obj->foo;
rcu_read_unlock();
-> store to urcu_reader.ctr
<urcu_reader.ctr store is globally visible>
list_del(obj);
<preempt>
<kernel space>

schedule();
cpumask_clear(mm_cpumask, cpu);

sys_membarrier();
<kernel space>
smp_mb()
iterates on mm_cpumask
smp_mb()
<user space>
set global g.p. (urcu_gp_ctr) phase to 1
wait for all urcu_reader.ctr in phase 0
set global g.p. (urcu_gp_ctr) phase to 0
wait for all urcu_reader.ctr in phase 1
sys_membarrier();
<kernel space>
smp_mb()
iterates on mm_cpumask
smp_mb()
<user space>
free(obj);
<list->next load hits memory>
<obj->foo load hits memory> <- corruption


There is a similar race scenario (symmetric to the one above) between
cpumask_set and a following user-space rcu_read_lock(), which I could
provide upon request.

This results in the following comments around smp_mb() in
sys_membarrier:

First sys_membarrier smp_mb():

/*
* Memory barrier on the caller thread before sending first
* IPI. Orders all memory accesses prior to sys_membarrier() before
* mm_cpumask read and membarrier_ipi executions.
* Matches memory barriers before and after mm_cpumask updates.
*/

Second sys_membarrier smp_mb():

/*
* Memory barrier on the caller thread after we finished
* waiting for the last IPI. Orders mm_cpumask read and membarrier_ipi
* executions before memory accesses following sys_membarrier()
* execution.
* Matches memory barriers before and after mm_cpumask updates.
*/

Thanks,

Mathieu


>
> Linus

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/