Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

From: Chen Yu
Date: Fri Mar 01 2024 - 01:47:09 EST


On 2024-02-29 at 20:10:32 +0800, Xuewen Yan wrote:
> Hi Yu
>
> On Wed, Feb 28, 2024 at 11:25 PM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> >
> > Hi Xuewen,
> >
> > On 2024-02-28 at 17:04:49 +0800, Xuewen Yan wrote:
> > > Hi Yu
> > >
> > > On Mon, Feb 26, 2024 at 4:26 PM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> > > >
> > > > There is occasional report from lkp that the kernel hits the NULL pointer
> > > > exception:
> > > >
> > > > [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
> > > > [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
> > > > [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
> > > > [ 512.082337][ T8305] *pde = 00000000
> > > > [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
> > > > [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
> > > > [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
> > > >
> > > > This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
> > > > After
> > > > commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> > > > the NULL candidate would trigger the NULL pointer exception. While before
> > > > this commit, there would be warning.
> > > >
> > > > This NULL entity issue was always there before above commit. With debug
> > > > patch to print the cfs_rq and all the entities in the tree, we have the
> > > > information when the issue was reproduced:
> > > >
> > > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
> > > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > [ 514.540340][ T8390] Found best:0x0
> > > > [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
> > > >
> > > > We can see that non of the entities in the tree are eligible, neither is
> > > > the current entity on this cfs_rq. As a result, curr is set to NULL:
> > > > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > > > curr = NULL;
> > > >
> > > > and the best is set to NULL, which caused the problem:
> > > > if (!best || (curr && entity_before(curr, best)))
> > > > best = curr;
> > > >
> > > > The cause is that, the curr is eligible, but vruntime_eligible()
> > > > returns false. And the false negative is due to the following
> > > > code in vruntime_eligible():
> > > >
> > > > return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
> > > >
> > > > According to the log, vruntime is 18435852013561943404, the
> > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > thus:
> > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > delta * 2677 = 7733399554989275921
> > > > that is to say, the multiply result overflow the s64, which turns the
> > > > negative value into a positive value, thus eligible check fails.
> > > >
> > > > So where is this insane huge vruntime 18435852013561943404 coming from?
> > > > My guess is that, it is because the initial value of cfs_rq->min_vruntime
> > > > is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> > > > seldom scheduled in, its vruntime might not move forward too much and
> > > > remain its original value by previous place_entity().
> > >
> > > Could you please test the follow patch:
> > >
> > > https://lore.kernel.org/all/20240130080643.1828-1-xuewen.yan@xxxxxxxxxx/
> > >
> > > Because the se->vruntime would be changed in place_entity() and
> > > reweight_eevdf(),
> > > however, now the vlag may exceed eevdf's limit in some scenarios, it may cause
> > > the vrunitme's error.
> > >
> >
> > Thanks for providing this fix. I took a quick look at your patch, and it looks
> > reasonable to me. I'm not quite sure if the vlag issue is the root cause of
> > this NULL candidate problem, say, if the abnormal vlag happens, it should satisfy:
> > curr->vlag == curr->deadline ?
>
> Actually, this is just a situation, because in set_next_entity(), it
> would set the valg be equal to deadline, and if using the vlag to
> update the vruntime, the vruntime is most likely wrong.

Yes, the child with wrong vlag might bring potential risk after child been place_entity().

> However, even if vlag is normal, after the calc: ((W + w_i)*vl'_i / W)
> or (w/w'), it may also be an out-of-bounds value.
> For example: if change the nice from -20(weight:88761) to
> 19(weight:15), its vlag would be 88761/15=5917*vlag.
>

Yes, this might exceed the limit range of vlag,

> >
> > According to the data,
> > curr->deadline =
> > cfs_rq->min_vruntime = 763383370431
> > cfs_rq->avg_vruntime = 386638640128
> > entity_key = -10892823530978643
> > avg = -6851585614346926319
> > load = 2677
> > so avg_vruntime() = 18444185410266494357
> > curr->vlag = 8333396704550953
> > which does not equal to curr->deadline = 18435852013562231446
> >
> > Anyway, if we want to root cause why vruntime is behind cfs_rq->min_vruntime so
> > much, we can add more debug code to further investigation. And we will also leverage
> > lkp to test your vlag fix patch set.
>
> Yes, maybe you can see more information by adding log in place_entity().
>

Sure, let me think about that.

thanks,
Chenyu