Re: [PATCH, RFC, tip/core/rcu] v3 scalable classic RCUimplementation

From: Paul E. McKenney
Date: Sun Aug 31 2008 - 13:20:23 EST


On Sun, Aug 31, 2008 at 12:58:12PM +0200, Manfred Spraul wrote:
> Paul E. McKenney wrote:
>>
>>> Perhaps it's possible to rely on CPU_DYING, but I haven't figured out yet
>>> how to handle read-side critical sections in CPU_DYING handlers.
>>> Interrupts after CPU_DYING could be handled by rcu_irq_enter(),
>>> rcu_irq_exit() [yes, they exist on x86: the arch code enables the local
>>> interrupts in order to process the currently queued interrupts]
>>>
>>
>> My feeling is that CPU online/offline will be quite rare, so it should
>> be OK to clean up after the races in force_quiescent_state(), which in
>> this version is called every three ticks in a given grace period.
>>
> If you add failing cpu offline calls, then the problem appears to be
> unsolvable:
> If I get it right, the offlining process looks like this:
> * one cpu in the system makes the CPU_DOWN_PREPARE notifier call. These
> calls can sleep (e.g. slab sleeps on semaphores). The cpu that goes offline
> is still alive, still doing arbitrary work. cpu_quiet calls on behalf of
> the cpu would be wrong.
> * stop_machine: all cpus schedule to a special kernel thread [1], only the
> dying cpu runs.
> * The cpu that goes offline calls the CPU_DYING notifiers.
> * __cpu_disable(): The cpu that goes offline check if it's possible to
> offline the cpu. At least on i386, this can fail.
> On success:
> * at least on i386: the cpu that goes offline handles outstanding
> interrupts. I'm not sure, perhaps even softirqs are handled.
> * the cpus stopps handling interrupts.
> * stop machine leaves, the remaining cpus continue their work.

As I understand it, this is the point where the dying CPU disables
interrupts and removes itself from the online masks. Though I would
feel better if there was an smp_mb() after the last local_irq_disable()
and before the remove_cpu_from_maps()!

> * The CPU_DEAD notifiers are called. They can sleep.
> On failure:
> * all cpus continue their work. call_rcu, synchronize_rcu(), ...
> * some time later: the CPU_DOWN_FAILED callbacks are called.
>
> Is that description correct?

Gautham?

> Then:
> - treating a cpu as always quiet after the rcu notifer was called with
> CPU_OFFLINE_PREPARE is wrong: the target cpu still runs normal code: user
> space, kernel space, interrupts, whatever. The target cpu still accepts
> interrupst, thus treating it as "normal" should work.

Indeed! My current code doesn't declare them offline until the CPU_DEAD
notifiers are called. And force_quiescent_state() does not consider
them to be offline until after they have cleared their bit in
cpu_online_map, which does not happen until the outgoing CPU has
disabled interrupts, at least in x86. So my current code should be
OK on x86.

It -looks- like stop_cpu() expects to be called with irqs disabled,
but I don't see what would be disabling irqs. (Don't kthreads normally
start with irqs enabled?) Ah, I see it -- the stop_cpu() threads
sequence through a state machine, and one of the states disables
irqs for everyone.

So the only problem would occur in architectures that re-enable irqs
in the middle of __cpu_disable(), as x86 does (but x86 correctly orders
the clearing of the cpu_online_mask bit, so is OK). This of course
has the added benefit that irq handlers aren't running on a CPU that
is marked offline.

Checking other architectures:

o ARM arch/arm/kernel/smp.c __cpu_disable() does not re-enable
irqs, so is OK.

! arch/ia64/kernel/smpboot.c __cpu_disable() clears itself
from the cpu_online mask before flushing pending irqs, which
might include RCU read-side critical sections. I believe that
the "cpu_clear(cpu, cpu_online_map)" must move to after the
"fixup_irqs()".

o arch/powerpc/kernel/smp.c __cpu_disable() does not disable
irqs directly, but calls subarch-specific functions noted
below.

o arch/powerpc/platforms/powermac/smp.c smp_core99_cpu_disable()
does not appear to re-enable irqs, so should be OK.

! arch/powerpc/kernel/smp.c generic_cpu_disable() clears itself
from the cpu_online_mask before invoking fixup_irqs(), which
momentarily enables irqs. I believe that the "cpu_clear(cpu,
cpu_online_map)" must move to after the "fixup_irqs()".

? arch/powerpc/platforms/pseries/hotplug-cpu.c pseries_cpu_disable()
clears itself from the cpu_online_mask before calling
xics_migrate_irqs_away(). This function rejects already-pending
irqs, then redirects future irqs. Not clear to me what happens
if an irq arrives between the reject and the immediately
following removal from the global interrupt queue.

o arch/s390/kernel/smp.c __cpu_disable() does not reenable irqs
so is OK.

! arch/sparc64/kernel/smp.c __cpu_disable() clears its bit before
re-enabling interrupts. I believe that the "cpu_clear(cpu,
cpu_online_map)" needs to happen after the local_irq_disable().

? include/asm-parisc/smp.h __cpu_disable() just returns without
doing anything. This means pa-risc does not support hotplug
CPU? If so, no problem.

I am sending (untested) patches separately for the amusement of the
arch maintainers.

> __cpu_disable() success:
> - after CPU_DYING, a cpu is either in an interrupt or outside read-side
> critical sections. Parallel synchronize_rcu() calls are impossible until
> the cpu is dead. call_rcu() is probably possible.
> - The CPU_DEAD notifiers are called. a synchronize_rcu() call before the
> rcu notifier is called is possible.
> __cpu_disable() failure:
> - CPU_DYING is called, but the cpu remains fully alive. The system comes
> fully alive again.
> - some time later, CPU_DEAD is called.
>
> With the current CPU_DYING callback, it's impossible to be both
> deadlock-free and race-free with the given conditions. If __cpu_disable()
> succeeds, then the cpu must be treated as gone and always idle. If
> __cpu_disable() fails, then the cpu must be treated as fully there. Doing
> both things at the same time is impossible. Waiting until CPU_DOWN_FAILED
> or CPU_DEAD is called is impossible, too: Either synchronize_rcu() in a
> CPU_DEAD notifier [called before the rcu notifier] would deadlock or
> read-side critical sections on the not-killed cpu would race.

Assuming that the ordering of processing pending irqs and marking the
CPU offline in cpu_online_mask can be resolved as noted above, it should
work fine -- if a CPU's bit is clear, we can safely ignore it. The race
can be resolved by checking the CPU's bit in force_quiescent_state().

Or am I missing something?

> What about moving the CPU_DYING notifier calls behind the __cpu_disable()
> call?
> Any other solutions?

RCU should ignore the CPU_DYING notifier calls -- only the CPU_DEAD.*
calls should be processed for CPUs being offlined. Right?

> Btw, as far as I can see, rcupreempt would deadlock if a CPU_DEAD notifier
> uses synchronize_rcu().
> Probably noone will ever succeed in triggering the deadlock:
> - cpu goes offline.
> - the other cpus in the system are restarted.
> - one cpu does the CPU_DEAD notifier calls.
> - before the rcu notifier is called with CPU_DEAD:
> - one CPU_DEAD notifier sleeps.
> - while CPU_DEAD is sleeping: on the same cpu: kmem_cache_destroy is
> called. get_online_cpus immediately succeeds.
> - kmem_cache_destroy acquires the cache_chain_mutex.
> - kmem_cache_destroy does synchronize_rcu(), it sleeps.
> - CPU_DEAD processing continues, the slab CPU_DEAD tries to acquire the
> cache_chain_mutex. it sleeps, too.
> --> deadlock, because the already dead cpu will never signal itself as
> quiet. Thus synchronize_rcu() will never succeed, thus the slab CPU_DEAD
> notifier will never return, thus rcu_offline_cpu() is never called.

It is entirely possible that rcu_try_flip_waitack() and
rcu_try_flip_waitmb() need to check the AND of rcu_cpu_online_map and
cpu_online_map. If this really is a problem (and it might well be),
then the easiest fix is to check for cpu_is_offline(cpu) in both
rcu_try_flip_waitmb_needed() and rcu_try_flip_waitack_needed(), and
that in both versions of both functions. Thoughts?

> --
> Manfred
> [1] open question: with rcu_preempt, is it possible that these cpus could
> be inside read side critical sections?

Yes, they could. Well, tasks that were previously running on them
might be preempted or blocked waiting on locks while still in RCU
read-side critical sections. However, when a given CPU goes offline,
rcu_preempt moves that CPU's counters to some surviving CPU. So RCU
knows that these tasks are still in RCU read-side critical sections,
and will therefore wait for them.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/