Re: regression 4.4: deadlock in with cgroup percpu_rwsem

From: Peter Zijlstra
Date: Wed Jan 20 2016 - 05:48:11 EST


On Wed, Jan 20, 2016 at 11:30:36AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 11:15:05AM +0100, Christian Borntraeger wrote:
> > [ 561.044066] Krnl PSW : 0704e00180000000 00000000001aa1ee (remove_entity_load_avg+0x1e/0x1b8)
>
> > [ 561.044176] ([<00000000001ad750>] free_fair_sched_group+0x80/0xf8)
> > [ 561.044181] [<0000000000192656>] free_sched_group+0x2e/0x58
> > [ 561.044187] [<00000000001ded82>] rcu_process_callbacks+0x3fa/0x928
>
> Urgh,.. lemme stare at that.

TJ, is css_offline guaranteed to be called in hierarchical order? I
got properly lost in the whole cgroup destroy code. There's endless
workqueues and rcu callbacks there.

So the current place in free_fair_sched_group() is far too late to be
calling remove_entity_load_avg(). But I'm not sure where I should put
it, it needs to be in a place where we know the group is going to die
but its parent is guaranteed to still exist.

Would offline be that place?