Re: [PATCH] sched/fair: Introduce priority load balance for CFS

From: Vincent Guittot
Date: Mon Nov 14 2022 - 11:43:14 EST


On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
>
> Hi, Vincent
>
> On 2022/11/3 17:22, Vincent Guittot wrote:
> > On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2022/11/3 16:33, Vincent Guittot wrote:
> >>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
> >>>>
> >>>> Thanks for your reply!
> >>>>
> >>>> On 2022/11/3 2:01, Vincent Guittot wrote:
> >>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:
> >>>>>>
> >>>>>
> >>>>> This really looks like a v3 of
> >>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@xxxxxxxxxx/
> >>>>>
> >>>>> Please keep versioning.
> >>>>>
> >>>>>> Add a new sysctl interface:
> >>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
> >>>>>
> >>>>> We don't want to add more sysctl knobs for the scheduler, we even
> >>>>> removed some. Knob usually means that you want to fix your use case
> >>>>> but the solution doesn't make sense for all cases.
> >>>>>
> >>>>
> >>>> OK, I will remove this knobs later.
> >>>>
> >>>>>>
> >>>>>> 0: default behavior
> >>>>>> 1: enable priority load balance for CFS
> >>>>>>
> >>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
> >>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
> >>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> >>>>>> as much as possible.
> >>>>>
> >>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
> >>>>>
> >>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> >>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
> >>>>> non idle task from CPU1 to CPU0 is not the best choice
> >>>>>
> >>>>
> >>>> If the non idle task on CPU1 is running or cache hot, it cannot be
> >>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
> >>>> think it does not matter.
> >>>
> >>> What I mean is that migrating non idle tasks first is not a universal
> >>> win and not always what we want.
> >>>
> >>
> >> But migrating online tasks first is mostly a trade-off that
> >> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
> >> the interference caused by IDLE tasks. I think this makes sense in most
> >> cases, or you can point out what else I need to think about it ?
> >>
> >> Best regards.
> >>
> >>>>
> >>>>>>
> >>>>>> Testcase:
> >>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> >>>>>
> >>>>> What do you mean by a large number ?
> >>>>>
> >>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
> >>>>>>
> >>>>>> Using schbench to test non-idle tasks latency:
> >>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
> >>>>>
> >>>>> How many CPUs do you have ?
> >>>>>
> >>>>
> >>>> OK, some details may not be mentioned.
> >>>> My virtual machine has 8 CPUs running with a schbench process and 5000
> >>>> idle tasks. The idle task is a while dead loop process below:
> >>>
> >>> How can you care about latency when you start 10 workers on 8 vCPUs
> >>> with 5000 non idle threads ?
> >>>
> >>
> >> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
> >> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> >
> > yes spawn 5000 idle tasks but my point remains the same
> >
>
> I am so sorry that I have not received your reply for a long time, and I
> am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
> works well in most scenarios, so it maybe possible to add a
> sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
> you can give me some better advice.

I have seen that you posted a v4 5 days ago which is on my list to be reviewed.

My concern here remains that selecting non idle task 1st is not always
the best choices as for example when you have 1 non idle task per cpu
and thousands of idle tasks moving around. Then regarding your use
case, the weight of the 5000 idle threads is around twice more than
the weight of your non idle bench: sum weight of idle threads is 15k
whereas the weight of your bench is around 6k IIUC how RPS run. This
also means that the idle threads will take a significant times of the
system: 5000 / 7000 ticks. I don't understand how you can care about
latency in such extreme case and I'm interested to get the real use
case where you can have such situation.

All that to say that idle task remains cfs task with a small but not
null weight and we should not make them special other than by not
preempting at wakeup.

>
> Best regards.
>
> Song Zhang