Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

From: Tim Chen
Date: Thu Jun 15 2023 - 13:02:17 EST

Next message: Yazen Ghannam: "Re: [PATCH] x86/mce: Schedule mce_setup() on correct CPU for CPER decoding"
Previous message: Mike Rapoport: "Re: [PATCH] x86,NUMA:Get the number of ram pages directly in numa_meminfo_cover_memory()"
In reply to: Peter Zijlstra: "Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups"
Next in thread: Tim Chen: "[Patch v2 4/6] sched/fair: Consider the idle state of the whole core for load balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2023-06-15 at 13:07 +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 10:46:36AM -0700, Tim Chen wrote:
> > On Mon, 2023-06-12 at 14:05 +0200, Peter Zijlstra wrote:
>
> > > > + /* Limit tasks moved from preferred group, don't leave cores idle */
> > > > + limit = busiest->sum_nr_running;
> > > > + lsub_positive(&limit, ncores_busiest);
> > > > + if (imbalance > limit)
> > > > + imbalance = limit;
> > >
> > > How does this affect the server parts that have larger than single core
> > > turbo domains?
> >
> > Are you thinking about the case where the local group is completely empty
> > so there's turbo headroom and we should move at least one task, even though
> > CPU in busiest group has higher priority?
>
> Something along those lines, I didn't think it through, just wondered
> about the wisdom of piling everything in the highest priority 'domain',
> which would depress the turbo range.
>
> Rjw said that on modern client the frequency domains are per-core, but
> that on server parts they are still wider -- or something long those
> lines. So it might make sense to consider some of that.
>

I see what you are saying. In that case, even for ASYM_PACKING, it probably
makes more sense to simply distribute tasks in proportion to number of cores
in sched group. And pay attention that we do not let preferred sched group
be idle.

>
> > In other words, are you suggesting we should add
> >
> > if (imbalance == 0 && busiest->sum_nr_running > 0 &&
> > local->sum_nr_running == 0)
> > imbalance = 1;
>
> I didn't get that far; and I don't think we have the right topology
> information on the servers to even begin considering the effects of the
> turbo-range, so perhaps it all doesn't matter.
>
> Just wanted to raise the point for consideration.
>
> Because as-is, the policy of piling extra on the preferred group doesn't
> immediately make sense. IIRC the whole ITMT preferred core is mostly
> about an extra turbo bin, but if you pile on, the headroom evaporates --
> you'll never turbo that high anymore and special casing it doesn't make
> sense.
>
> So perhaps I'm not saying more code, but less code is better here.
>
> Dunno, is any of this measurable either way around?

As far as I know, only client CPUs have ITMT and asymmetric turbo.
So this behavior could only be approximated on a client's Atom clusters with
big cores turned off.

>
> > > > +
> > > > + goto out;
> > > > + }
> > > > +
> > > > + /* Take advantage of resource in an empty sched group */
> > > > + if (imbalance == 0 && local->sum_nr_running == 0 &&
> > > > + busiest->sum_nr_running > 1)
> > > > + imbalance = 1;
> > > > +out:
> > > > + return imbalance << 1;
> > > > +}
> > >
> > >
> > > But basically you have:
> > >
> > > LcBn - BcLn
> > > imb = -----------
> > > LcBc
> > >
> > > Which makes sense, except you then return:
> > >
> > > imb * 2
> > >
> > > which then made me wonder about rounding.
> > >
> > > Do we want to to add (LcBc -1) or (LcBc/2) to resp. ceil() or round()
> > > the thing before division? Because currently it uses floor().
> > >
> > > If you evaludate it like:
> > >
> > >
> > > 2 * (LcBn - BcLn)
> > > imb = -----------------
> > > LcBc
> > >
> > > The result is different from what you have now.
> >
> > If I do the rounding after multiplying imb by two (call it imb_new),
> > the difference with imb I am returning now (call it imb_old)
> > will be at most 1. Note that imb_old returned is always a multiple of 2.
> >
> > I will be using imb in calculate_imbalance() and divide it
> > by 2 there to get the number tasks to move from busiest group.
> > So when there is a difference of 1 between imb_old and imb_new,
> > the difference will be trimmed off after the division of 2.
> >
> > We will get the same number of tasks to move with either
> > imb_old or imb_new in calculate_imbalance() so the two
> > computations will arrive at the same result eventually.
> >
> > >
> > > What actual behaviour is desired in these low imbalance cases? and can
> > > you write a comment as to why we do as we do etc..?
> >
> > I do not keep imb as
> >
> > 2 * (LcBn - BcLn)
> > imb = -----------------
> > LcBc
> >
> > as it is easier to leave out the factor of 2
> > in the middle of sibling_imblalance() computation
> > so I can directly interpret imb as the number
> > of tasks to move, and add the factor of two
> > when I actually need to return the imbalance.
> >
> > Would you like me to add this reasoning in the comments?
>
> So if we want a multiple of 2, leaving that multiplication off makes
> sense, but I'm not sure I got the argument for or against the rounding.
>
> floor() gets us 1 task to move when there is at least a whole task's
> worth of imbalance, but less than 2.
>
> round() would get us 1 task to move when there's at least half a task's
> worth of imbalance but less than 1.5.
>
> ceil() will migrate on any imbalance, however small -- which will result
> in ping-pong I suppose, so let's disregard that.
>
> The difference, with the multiplcation later, is 0 or 2.
>
> Does the round() still result in ping-pong?

Will have to experiment to see whether round() is better.
I chose floor() on purpose in my initial implementation
to minimize migrations.

Tim

Next message: Yazen Ghannam: "Re: [PATCH] x86/mce: Schedule mce_setup() on correct CPU for CPER decoding"
Previous message: Mike Rapoport: "Re: [PATCH] x86,NUMA:Get the number of ram pages directly in numa_meminfo_cover_memory()"
In reply to: Peter Zijlstra: "Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups"
Next in thread: Tim Chen: "[Patch v2 4/6] sched/fair: Consider the idle state of the whole core for load balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]