Re: linux-next: manual merge of the rcu tree with the tip tree

From: Mathieu Desnoyers
Date: Tue Aug 01 2017 - 10:02:42 EST


----- On Aug 1, 2017, at 9:43 AM, Andy Lutomirski luto@xxxxxxxxxx wrote:

> On Mon, Jul 31, 2017 at 9:03 PM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>> On Tue, Aug 01, 2017 at 12:04:05AM +0000, Mathieu Desnoyers wrote:
>>> ----- On Jul 31, 2017, at 12:13 PM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx
>>> wrote:
>>>
>
>> Thanx, Paul
>>
>> ------------------------------------------------------------------------
>>
>> commit fde19879b6bd1abc0c1d4d5f945efed61bf7eb8c
>> Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>> Date: Fri Jul 28 16:40:40 2017 -0400
>>
>> membarrier: Expedited private command
>>
>> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
>> from all runqueues for which current thread's mm is the same as the
>> thread calling sys_membarrier. It executes faster than the non-expedited
>> variant (no blocking). It also works on NOHZ_FULL configurations.
>>
>> Scheduler-wise, it requires a memory barrier before and after context
>> switching between processes (which have different mm). The memory
>> barrier before context switch is already present. For the barrier after
>> context switch:
>>
>> * Our TSO archs can do RELEASE without being a full barrier. Look at
>> x86 spin_unlock() being a regular STORE for example. But for those
>> archs, all atomics imply smp_mb and all of them have atomic ops in
>> switch_mm() for mm_cpumask().
>
> I think that, on x86, context switches, even without mm changes, must
> at least flush the store buffer (maybe SFENCE is okay) to avoid
> visible inconsistency due to store-buffer forwarding.
>
> Anyway, can you document whatever property you require with a comment
> in switch_mm() or wherever you're finding that property so that future
> arch changes don't break it?

As I asked to Paul in my reply to his proposed manual merge,
we should indeed have a comment in switch_mm() stating something
like this just before the line invoking cpumask_set_cpu():

/*
* The full memory barrier implied by mm_cpumask update operations
* is required by the membarrier system call.
*/

What we want to order here is:

prev userspace memory accesses
schedule
<full mb> (it's already there) [A]
update to rq->curr changing the rq->curr->mm value
<full mb> (provided by mm_cpumask updates in switch_mm on x86) [B]
next userspace memory accesses

wrt to:

userspace memory accesses
sys_membarrier
<full mb> [C]
iterate on each cpu's rq->curr, compare their "mm" to current->mm
IPI each CPU that match
<full mb> [D]
userspace memory accesses

[A] pairs with [D] and [B] pairs with [C].


>
>> +static void membarrier_private_expedited(void)
>> +{
>> + int cpu;
>> + bool fallback = false;
>> + cpumask_var_t tmpmask;
>> +
>> + if (num_online_cpus() == 1)
>> + return;
>> +
>> + /*
>> + * Matches memory barriers around rq->curr modification in
>> + * scheduler.
>> + */
>> + smp_mb(); /* system call entry is not a mb. */
>> +
>> + /*
>> + * Expedited membarrier commands guarantee that they won't
>> + * block, hence the GFP_NOWAIT allocation flag and fallback
>> + * implementation.
>> + */
>> + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>> + /* Fallback for OOM. */
>> + fallback = true;
>> + }
>> +
>> + cpus_read_lock();
>> + for_each_online_cpu(cpu) {
>> + struct task_struct *p;
>> +
>> + /*
>> + * Skipping the current CPU is OK even through we can be
>> + * migrated at any point. The current CPU, at the point
>> + * where we read raw_smp_processor_id(), is ensured to
>> + * be in program order with respect to the caller
>> + * thread. Therefore, we can skip this CPU from the
>> + * iteration.
>> + */
>> + if (cpu == raw_smp_processor_id())
>> + continue;
>> + rcu_read_lock();
>> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> + if (p && p->mm == current->mm) {
>
> I'm a bit surprised you're iterating all CPUs instead of just CPUs in
> mm_cpumask().

I see two reasons for this. The first is because architectures like
ARM64 don't even bother populating the mm_cpumask. The second reason
is because I don't think all architectures ensure that updates to
mm_cpumask imply full memory barriers. Therefore, we would need to revisit
each architecture switch_mm to ensure mm_cpumask bit set ops either imply
a full memory barrier, or are followed by an explicit one, if we
choose to use this bitmask as an optimization.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com