Re: Crash in list_add_leaf_cfs_rq due to bad tmp_alone_branch

From: Vincent Guittot
Date: Mon Jan 21 2019 - 09:46:35 EST


Hi Sargun,

Le Friday 18 Jan 2019 à 15:06:28 (+0100), Vincent Guittot a écrit :
> On Fri, 18 Jan 2019 at 11:16, Vincent Guittot
> <vincent.guittot@xxxxxxxxxx> wrote:
> >
> > On Wed, 9 Jan 2019 at 23:43, Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> > >
> > > On Wed, Jan 9, 2019 at 2:14 PM Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> > > >
> > > > I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix
> > > > infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
> > > > and put it on top of 4.19.13. In addition to this, I uninlined
> > > > list_add_leaf_cfs_rq for debugging.
>
> With the fix above applied, the code that manages the leaf_cfs_rq_list
> is the same since v4.9.
> Have you noticed similar problem on other older kernel version between
> v4.9 and v4.19 ? The problem might have been introduce while modifying
> other part of the scheduler like the sequence for adding/removing
> cgroup.
>
> Knowing the most recent kernel version without the problem could help
> to narrow the problem
>
> Thanks,
> Vincent
>
> > > >
> > > > This revealed a new bug that we didn't get to because we kept getting
> > > > crashes from the previous issue. When we are running with cgroups that
> > > > are rapidly changing, with CFS bandwidth control, and in addition
> > > > using the cpusets cgroup, we see this crash. Specifically, it seems to
> > > > occur with cgroups that are throttled and we change the allowed
> > > > cpuset.
> >
> > Thanks for the context, I will try to reproduce the problem and
> > understand how we can stop in the middle of walking to the
> > sched_entity branch with a parent not already added
> >
> > How many cgroup level have you got in you setup ?
> >
> > > >
> > >
> > > This patch from Gabriel should fix the problem:
> > >
> > >
> > > [PATCH] sched/fair: Reset tmp_alone_branch on cfs_rq delete
> > >
> > > When a child cfs_rq is added to the leaf cfs_rq list before its parent
> > > tmp_alone_branch is set to point to the child in preparation for the
> > > parent being added.
> > >
> > > If the child is deleted before the parent is added then tmp_alone_branch
> > > points to a freed cfs_rq. Any future reference to tmp_alone_branch will
> > > result in a use after free.
> >
> > So, the patch below is a temporary fix that helps to recover from the
> > situation where tmp_alone_branch doesn't finished back to
> > rq->leaf_cfs_rq_list
> > But this situation should not happened at the beginning

I have been able to reproduce the situation where tmp_alone_branch doesn't
point to rq->leaf_cfs_rq_list after enqueuing a task.

Can you try the patch below which ensures all cfs_rq of a cgroup branch will
be added in the list even if throttled ?

The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that
it will walk down to root the 1st time a cfs_rq is used and we will finished
to add either a cfs_rq without parent or a cfs_rq with a parent that is already
on the list. But this is not always true in presence of throttling.
Because a cfs_rq can be throttled even if it has never been used but other CPUS
of the cgroup have already used all the bandwdith, we are not sure to go down to
the root and add all cfs_rq in the list.

Ensure that all cfs_rq will be added in the list even if they are throttled.

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
kernel/sched/fair.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6483834..ae468ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
}
}

+static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)
+{
+struct cfs_rq *cfs_rq;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ list_add_leaf_cfs_rq(cfs_rq);
+
+ /* If parent is already in the list, we can stop */
+ if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
+ break;
+ }
+}
+
/* Iterate through all leaf cfs_rq's on a runqueue: */
#define for_each_leaf_cfs_rq(rq, cfs_rq) \
list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
@@ -5177,6 +5191,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

}

+ /* Ensure that all cfs_rq have been added to the list */
+ list_add_branch_cfs_rq(se, rq);
+
hrtick_update(rq);
}



> >
> >
> > >
> > > Signed-off-by: Gabriel Hartmann <gabriel.hartmann@xxxxxxxxx>
> > > Reported-by: Sargun Dhillon <sargun@xxxxxxxxx>
> > > ---
> > > kernel/sched/fair.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 7137bc343b4a..0987629cbb76 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct
> > > cfs_rq *cfs_rq)
> > > static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > > {
> > > if (cfs_rq->on_list) {
> > > + struct rq *rq = rq_of(cfs_rq);
> > > +
> > > + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > > + rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
> > > +
> > > list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> > > cfs_rq->on_list = 0;
> > > }