Re: Null pointer crash at find_idlest_group on db845c w/ linus/master

From: Valentin Schneider
Date: Tue Dec 03 2019 - 19:13:56 EST


On 03/12/2019 23:49, Valentin Schneider wrote:
> On 03/12/2019 23:20, Valentin Schneider wrote:
>> Looking at the code, I think I got it. In find_idlest_group() we do
>> initialize 'idlest_sgs' (just like busiest_stat in LB) but 'idlest' is just
>> NULL. The latter is dereferenced in update_pick_idlest() just for the misfit
>> case, which goes boom. And I reviewed the damn thing... Bleh.
>>
>> Fixup looks easy enough, lemme write one up.
>>
>
> Wait no, that can't be right. We can only get in there if both 'group' and
> 'idlest' have the same group_type, which can't be true on the first pass.
> So if we go through the misfit stuff, idlest *has* to be set to something.
> Bah.
>

So I think the thing really is dying on a sched_group->sgc deref (pahole says
sgc is at offset #16), which means we have a NULL sched_group somewhere, but
I don't see how. That can either be 'local' (can't be, first group we visit
and doesn't go through update_pick_idlest()) or 'idlest' (see previous email).

Now, it's bedtime for me, if you get the chance in the meantime can you give
this a shot? I was about to send it out but realized it didn't really make
sense, but you never know...

Also, if it is indeed misfit related, I'm surprised we (Arm folks) haven't
hit it sooner. We've had our scheduler tests running on the LB rework for at
least a month, so we should've hit it.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..e19ab7bff0f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8348,7 +8348,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
return false;

case group_misfit_task:
- /* Select group with the highest max capacity */
+ /*
+ * Select group with the highest max capacity. First group we
+ * visit gets picked as idlest to allow later capacity
+ * comparisons.
+ */
+ if (!idlest)
+ return true;
+
if (idlest->sgc->max_capacity >= group->sgc->max_capacity)
return false;
break;