Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's

From: Benjamin Segall
Date: Wed Nov 03 2021 - 18:04:12 EST


Mathias Krause <minipli@xxxxxxxxxxxxxx> writes:

> Kevin is reporting crashes which point to a use-after-free of a cfs_rq
> in update_blocked_averages(). Initial debugging revealed that we've live
> cfs_rq's (on_list=1) in an about to be kfree()'d task group in
> free_fair_sched_group(). However, it was unclear how that can happen.
> [...]
> Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
> Cc: Odin Ugedal <odin@xxxxxxx>
> Cc: Michal Koutný <mkoutny@xxxxxxxx>
> Reported-by: Kevin Tanguy <kevin.tanguy@xxxxxxxxxxxx>
> Suggested-by: Brad Spengler <spender@xxxxxxxxxxxxxx>
> Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx>
> ---
> kernel/sched/core.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 978460f891a1..60125a6c9d1b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg)
> {
> unsigned long flags;
>
> - /* End participation in shares distribution: */
> - unregister_fair_sched_group(tg);
> -
> + /*
> + * Unlink first, to avoid walk_tg_tree_from() from finding us (via
> + * sched_cfs_period_timer()).
> + */
> spin_lock_irqsave(&task_group_lock, flags);
> list_del_rcu(&tg->list);
> list_del_rcu(&tg->siblings);
> spin_unlock_irqrestore(&task_group_lock, flags);
> +
> + /*
> + * Wait for all pending users of this task group to leave their RCU
> + * critical section to ensure no new user will see our dying task
> + * group any more. Specifically ensure that tg_unthrottle_up() won't
> + * add decayed cfs_rq's to it.
> + */
> + synchronize_rcu();

I was going to suggest that we could just clear all of avg.load_sum/etc, but
that breaks the speculative on_list read. Currently the final avg update
just races, but that's not good enough if we wanted to rely on it to
prevent UAF. synchronize_rcu() doesn't look so bad if the alternative is
taking every rqlock anyways.

I do wonder if we can move the relevant part of
unregister_fair_sched_group into sched_free_group_rcu. After all
for_each_leaf_cfs_rq_safe is not _rcu and update_blocked_averages does
in fact hold the rqlock (though print_cfs_stats thinks it is _rcu and
should be updated).


> +
> + /* End participation in shares distribution: */
> + unregister_fair_sched_group(tg);
> }
>
> static void sched_change_group(struct task_struct *tsk, int type)