Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

From: Mathias Krause
Date: Thu Nov 04 2021 - 11:13:27 EST


Am 04.11.21 um 09:50 schrieb Vincent Guittot:
> On Wed, 3 Nov 2021 at 23:04, Benjamin Segall <bsegall@xxxxxxxxxx> wrote:
>>
>> Mathias Krause <minipli@xxxxxxxxxxxxxx> writes:
>>
>>> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
>>> in update_blocked_averages(). Initial debugging revealed that we've live
>>> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
>>> free_fair_sched_group(). However, it was unclear how that can happen.
>>> [...]
>>> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
>>> Cc: Odin Ugedal <odin@xxxxxxx>
>>> Cc: Michal Koutný <mkoutny@xxxxxxxx>
>>> Reported-by: Kevin Tanguy <kevin.tanguy@xxxxxxxxxxxx>
>>> Suggested-by: Brad Spengler <spender@xxxxxxxxxxxxxx>
>>> Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx>
>>> ---
>>> kernel/sched/core.c | 18 +++++++++++++++---
>>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 978460f891a1..60125a6c9d1b 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
>>> {
>>> unsigned long flags;
>>>
>>> - /* End participation in shares distribution: */
>>> - unregister_fair_sched_group(tg);
>>> -
>>> + /*
>>> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
>>> + * sched_cfs_period_timer()).
>>> + */
>>> spin_lock_irqsave(&task_group_lock, flags);
>>> list_del_rcu(&tg->list);
>>> list_del_rcu(&tg->siblings);
>>> spin_unlock_irqrestore(&task_group_lock, flags);
>>> +
>>> + /*
>>> + * Wait for all pending users of this task group to leave their RCU
>>> + * critical section to ensure no new user will see our dying task
>>> + * group any more. Specifically ensure that tg_unthrottle_up() won't
>>> + * add decayed cfs_rq's to it.
>>> + */
>>> + synchronize_rcu();
>>
>> I was going to suggest that we could just clear all of avg.load_sum/etc, but
>> that breaks the speculative on_list read. Currently the final avg update
>> just races, but that's not good enough if we wanted to rely on it to
>> prevent UAF. synchronize_rcu() doesn't look so bad if the alternative is
>> taking every rqlock anyways.
>>
>> I do wonder if we can move the relevant part of
>> unregister_fair_sched_group into sched_free_group_rcu. After all
>> for_each_leaf_cfs_rq_safe is not _rcu and update_blocked_averages does
>> in fact hold the rqlock (though print_cfs_stats thinks it is _rcu and
>> should be updated).
>
> I was wondering the same thing.
> we would have to move unregister_fair_sched_group() completely in
> sched_free_group_rcu() and probably in cpu_cgroup_css_free() too.

Well, the point is, print_cfs_stats() pretty much relies on the list to
be stable, i.e. safe to traverse. It doesn't take locks while walking it
(beside the RCU lock), so if we would modify it concurrently, bad things
would happen.

Now, all manipulations of &cfs_rq->leaf_cfs_rq_list happen via the
list_*_rcu() variants, so that assumption in print_cfs_stats() is
actually sound -- the list chain can be walked without fearing to
dereference bad pointers. But if we would move the unchaining of
cfs_rq's down to sched_free_group_rcu() (or free_fair_sched_group(), as
that gets called more or less directly from sched_free_group_rcu()), we
would have the unchaining and kfree() of cfs_rq's in the same RCU GP.
Or, ignoring RCU for a moment, print_cfs_stats() may see a cfs_rq on one
CPU we're about to kfree() on the other. So, at least, the
kfree(tg->cfs_rq[i]) would need to become a kfree_rcu(). But likely all
the others as well, as print_cfs_stats() looks at cfs_rq->tg and
cfs_rq->tg->se[cpu], too.

> remove_entity_load_avg(tg->se[cpu]); has to be called only once we are
> sure that se->my_q will not be updated which means no more in the
> leaf_list

And that would be only be after tq was unlinked in sched_offline_group()
so walk_tg_tree_from() can't find it any more (to prevent re-linking the
cfs_rq) and after unregister_fair_sched_group() has run. But between the
two an RCU GP is needed. We can, sure, use the existing one by moving
unregister_fair_sched_group() to sched_free_group_rcu() /
cpu_cgroup_css_free(). But then we also need another RCU GP after that,
prior to releasing the involved objects, to ensure print_cfs_stats()
won't be the new UAF subject.

Thanks,
Mathias