Re: [PATCH, RFC, tip/core/rcu] v3 scalable classic RCU implementation

From: Manfred Spraul
Date: Sun Aug 31 2008 - 06:58:29 EST


Paul E. McKenney wrote:

Perhaps it's possible to rely on CPU_DYING, but I haven't figured out yet how to handle read-side critical sections in CPU_DYING handlers.
Interrupts after CPU_DYING could be handled by rcu_irq_enter(), rcu_irq_exit() [yes, they exist on x86: the arch code enables the local interrupts in order to process the currently queued interrupts]

My feeling is that CPU online/offline will be quite rare, so it should
be OK to clean up after the races in force_quiescent_state(), which in
this version is called every three ticks in a given grace period.
If you add failing cpu offline calls, then the problem appears to be unsolvable:
If I get it right, the offlining process looks like this:
* one cpu in the system makes the CPU_DOWN_PREPARE notifier call. These calls can sleep (e.g. slab sleeps on semaphores). The cpu that goes offline is still alive, still doing arbitrary work. cpu_quiet calls on behalf of the cpu would be wrong.
* stop_machine: all cpus schedule to a special kernel thread [1], only the dying cpu runs.
* The cpu that goes offline calls the CPU_DYING notifiers.
* __cpu_disable(): The cpu that goes offline check if it's possible to offline the cpu. At least on i386, this can fail.
On success:
* at least on i386: the cpu that goes offline handles outstanding interrupts. I'm not sure, perhaps even softirqs are handled.
* the cpus stopps handling interrupts.
* stop machine leaves, the remaining cpus continue their work.
* The CPU_DEAD notifiers are called. They can sleep.
On failure:
* all cpus continue their work. call_rcu, synchronize_rcu(), ...
* some time later: the CPU_DOWN_FAILED callbacks are called.

Is that description correct?
Then:
- treating a cpu as always quiet after the rcu notifer was called with CPU_OFFLINE_PREPARE is wrong: the target cpu still runs normal code: user space, kernel space, interrupts, whatever. The target cpu still accepts interrupst, thus treating it as "normal" should work.
__cpu_disable() success:
- after CPU_DYING, a cpu is either in an interrupt or outside read-side critical sections. Parallel synchronize_rcu() calls are impossible until the cpu is dead. call_rcu() is probably possible.
- The CPU_DEAD notifiers are called. a synchronize_rcu() call before the rcu notifier is called is possible.
__cpu_disable() failure:
- CPU_DYING is called, but the cpu remains fully alive. The system comes fully alive again.
- some time later, CPU_DEAD is called.

With the current CPU_DYING callback, it's impossible to be both deadlock-free and race-free with the given conditions. If __cpu_disable() succeeds, then the cpu must be treated as gone and always idle. If __cpu_disable() fails, then the cpu must be treated as fully there. Doing both things at the same time is impossible. Waiting until CPU_DOWN_FAILED or CPU_DEAD is called is impossible, too: Either synchronize_rcu() in a CPU_DEAD notifier [called before the rcu notifier] would deadlock or read-side critical sections on the not-killed cpu would race.

What about moving the CPU_DYING notifier calls behind the __cpu_disable() call?
Any other solutions?

Btw, as far as I can see, rcupreempt would deadlock if a CPU_DEAD notifier uses synchronize_rcu().
Probably noone will ever succeed in triggering the deadlock:
- cpu goes offline.
- the other cpus in the system are restarted.
- one cpu does the CPU_DEAD notifier calls.
- before the rcu notifier is called with CPU_DEAD:
- one CPU_DEAD notifier sleeps.
- while CPU_DEAD is sleeping: on the same cpu: kmem_cache_destroy is called. get_online_cpus immediately succeeds.
- kmem_cache_destroy acquires the cache_chain_mutex.
- kmem_cache_destroy does synchronize_rcu(), it sleeps.
- CPU_DEAD processing continues, the slab CPU_DEAD tries to acquire the cache_chain_mutex. it sleeps, too.
--> deadlock, because the already dead cpu will never signal itself as quiet. Thus synchronize_rcu() will never succeed, thus the slab CPU_DEAD notifier will never return, thus rcu_offline_cpu() is never called.

--
Manfred
[1] open question: with rcu_preempt, is it possible that these cpus could be inside read side critical sections?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/