Re: [RFC PATCH] sched/fair: update the vruntime to be max vruntime when yield

From: Xuewen Yan
Date: Mon Aug 21 2023 - 03:51:46 EST


Hi Vincent

I have some questions to ask,and hope you can help.

For this problem, In our platform, We found that the vruntime of some
tasks will become abnormal over time, resulting in tasks with abnormal
vruntime not being scheduled.
The following are some tasks in runqueue:
[status: curr] pid: 25501 prio: 116 vrun: 16426426403395799812
[status: skip] pid: 25496 prio: 116 vrun: 16426426403395800756
exec_start: 326203047009312 sum_ex: 29110005599
[status: pend] pid: 25497 prio: 116 vrun: 16426426403395800705
exec_start: 326203047002235 sum_ex: 29110508751
[status: pend] pid: 25321 prio: 130 vrun: 16668783152248554223
exec_start: 0 sum_ex: 16198728
[status: pend] pid: 25798 prio: 112 vrun: 17467381818375696015
exec_start: 0 sum_ex: 9574265
[status: pend] pid: 22282 prio: 120 vrun: 18010356387391134435
exec_start: 0 sum_ex: 53192
[status: pend] pid: 24259 prio: 120 vrun: 359915144918430571
exec_start: 0 sum_ex: 20508197
[status: pend] pid: 25988 prio: 120 vrun: 558552749871164416
exec_start: 0 sum_ex: 2099153
[status: pend] pid: 21857 prio: 124 vrun: 596088822758688878
exec_start: 0 sum_ex: 246057024
[status: pend] pid: 26614 prio: 130 vrun: 688210016831095807
exec_start: 0 sum_ex: 968307
[status: pend] pid: 14229 prio: 120 vrun: 816756964596474655
exec_start: 0 sum_ex: 793001
[status: pend] pid: 23866 prio: 120 vrun: 1313723379399791578
exec_start: 0 sum_ex: 1507038
...
[status: pend] pid: 25970 prio: 120 vrun: 6830180148220001175
exec_start: 0 sum_ex: 2531884
[status: pend] pid: 25965 prio: 120 vrun: 6830180150700833203
exec_start: 0 sum_ex: 8031809

And According to your suggestion, we test the patch:
https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@xxxxxxxxx/T/#u
The above exception is gone.

But when we tested using patch:
https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@xxxxxxxxx/T/#u
and
https://lore.kernel.org/all/20230317160810.107988-1-vincent.guittot@xxxxxxxxxx/
Unfortunately, our issue occurred again.

So we have to use a workaround solution to our problem, that is to
change the sleeping time's judgement to 60s.
+
+ sleep_time -= se->exec_start;
+ if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
+ return true;

to

+ sleep_time -= se->exec_start;
+if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
+ return true;

At this time, the issue also did not occur again.

But this modification doesn't actually solve the real problem. And then
Qais suggested us to try this patch:
https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@xxxxxxx/T/#u

And we tested the patch(android phone, monkey test with 60 apk, 7days).
It did not reproduce the previous problem.

We would really appreciate it if you could take a look at the patch
and help see what goes wrong.

Thanks!
BR

---
xuewen

On Fri, Jun 30, 2023 at 10:40 PM Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> Hi Xuewen
>
> On 03/01/23 16:20, Xuewen Yan wrote:
> > On Wed, Mar 1, 2023 at 4:09 PM Vincent Guittot
> > <vincent.guittot@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 1 Mar 2023 at 08:30, Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:
> > > >
> > > > Hi Vincent
> > > >
> > > > I noticed the following patch:
> > > > https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@xxxxxxxxx/
> > > > And I notice the V2 had merged to mainline:
> > > > https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@xxxxxxxxx/T/#u
> > > >
> > > > The patch fixed the inversing of the vruntime comparison, and I see
> > > > that in my case, there also are some vruntime is inverted.
> > > > Do you think which patch will work for our scenario? I would be very
> > > > grateful if you could give us some advice.
> > > > I would try this patch in our tree.
> > >
> > > By default use the one that is merged; The difference is mainly a
> > > matter of time range. Also be aware that the case of newly migrated
> > > task is not fully covered by both patches.
> >
> > Okay, Thank you very much!
> >
> > >
> > > This patch fixes a problem with long sleeping entity in the presence
> > > of low weight and always running entities. This doesn't seem to be
> > > aligned with the description of your use case
> >
> > Thanks for the clarification! We would try it first to see whether it
> > could resolve our problem.
>
> Did you get a chance to see if that patch help? It'd be good to backport it to
> LTS if it does.
>
>
> Thanks
>
> --
> Qais Yousef