Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

From: Mathias Krause
Date: Wed Nov 03 2021 - 10:16:39 EST


Am 03.11.21 um 12:10 schrieb Michal Koutný:
> On Wed, Nov 03, 2021 at 11:51:12AM +0100, Mathias Krause <minipli@xxxxxxxxxxxxxx> wrote:
>> Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
>> will have finished what they're doing, so we can unlink safely. That
>> was, apparently, the missing piece.
>
> What reader(s) are you referring to here? The
> list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
> throttled_list) in distribute_cfs_runtime()?

Let me quote a little bit more context here to explain my rationale:

>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 978460f891a1..afee07e9faf9 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
>> {
>> unsigned long flags;
>>
>> - /* End participation in shares distribution: */
>> - unregister_fair_sched_group(tg);
>> -
>> + /*
>> + * Unlink first, to avoid walk_tg_tree_from() from finding us
>> + * (via sched_cfs_period_timer()).
>> + */

>> spin_lock_irqsave(&task_group_lock, flags);
>> list_del_rcu(&tg->list);
>> list_del_rcu(&tg->siblings);
>> spin_unlock_irqrestore(&task_group_lock, flags);

The above list_del_rcu() implies there are users that use RCU semantics
to walk the lists, namely tg->list and tg->siblings. And, in fact,
walk_tg_tree_from() does just that:

[...]
list_for_each_entry_rcu(child, &parent->children, siblings) {
[...]
}

Now, taking the task_group_lock in sched_offline_group() just prevents
multiple users from entering this code path. But walk_tg_tree_from()
does not take that lock, so can walk the task group hierarchy concurrently.

Disabling IRQs also doesn't protect us from do_sched_cfs_period_timer()
running in parallel, as the timer interrupt might fire on a different
CPU than the one we're currently running on.

So, assuming the scenario in race.txt (sorry, my mailer is just too dump
for bigger ASCI art), CPU2 will walk the task group list in parallel to
sched_offline_group(), specifically it'll read the soon to be unlinked
task group entry in (1). So removing it on CPU1 in (2) won't prevent
CPU2 from still passing it on to tg_unthrottle_up(). Assuming
unregister_fair_sched_group() on CPU1 now tries to unlink all cfs_rq's
via calls to list_del_leaf_cfs_rq(), CPU2 will re-add some in (3), which
is the cause of the UAF later on. Now, if we would add a
synchronize_rcu() prior to calling unregister_fair_sched_group(), we
would wait until the walk on CPU2 has finished at (4). Afterwards each
new walk won't see the dying task group any more, thereby preventing the
UAF from happening. Or am I missing something?

>> +
>> + /* End participation in shares distribution: */
>
> Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
> will have finished what they're doing, so we can unlink safely. That
> was, apparently, the missing piece.
>
>> + unregister_fair_sched_group(tg);
>> }
>>
>> static void sched_change_group(struct task_struct *tsk, int type)
>>

Btw, the above patch (plus the warning debug patch) is running your
reproducer here for 3.5h without triggering any warnings. It used to
trigger them within the first few minutes.

> I'm think (not sure) that wouldn't work since the unthrottle_cfs_rq can
> still be called after this synchronize_rcu() but before
> free_fair_sched_group().

Sure, but it won't see the dying task group any more. So won't re-add
cfs_rq's to them, preventing the UAF from happening.

> (But if you considered update_blocked_averages() as the reader(s) and
> synchronize_rcu() within free_fair_sched_group(), that may ensure UBA
> won't step on a free'd cfs_rq (+unlinking would need to happen only in
> free_fair_sched_group() too.))

I'd prefer not linking cfs_rq's to a dead tg. Trying to cover up just
before we're about to free the memory feels hacky.

Thanks,
Mathias
CPU1: CPU2:
: timer IRQ:
: do_sched_cfs_period_timer():
: :
: distribute_cfs_runtime():
: rcu_read_lock();
: :
: unthrottle_cfs_rq():
sched_offline_group(): :
: walk_tg_tree_from(…,tg_unthrottle_up,…):
list_del_rcu(&tg->list); :
(1) : list_for_each_entry_rcu(child, &parent->children, siblings)
: :
(2) list_del_rcu(&tg->siblings); :
: tg_unthrottle_up():
unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
: :
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
: :
: if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
(3) : list_add_leaf_cfs_rq(cfs_rq);
: :
: :
: :
: :
: :
(4) : rcu_read_unlock();