Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups

From: Shrikanth Hegde
Date: Mon Jul 17 2023 - 08:39:07 EST




On 7/17/23 4:40 PM, Peter Zijlstra wrote:
> On Mon, Jul 17, 2023 at 01:06:59AM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 7/15/23 4:35 AM, Tim Chen wrote:
>>> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
>>>
>>>>
>>>>
>>>> If we consider symmetric platforms which have SMT4 such as power10.
>>>> we have a topology like below. multiple such MC will form DIE(PKG)
>>>>
>>>>
>>>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>>>> [--SMT--][--SMT--][----SMT---][---SMT----]
>>>> [--sg1--][--sg1--][---sg1----][---sg1----]
>>>> [--------------MC------------------------]
>>>>
>>>> In case of SMT4, if there is any group which has 2 or more tasks, that
>>>> group will be marked as group_smt_balance. previously, if that group had 2
>>>> or 3 tasks, it would have been marked as group_has_spare. Since all the groups have
>>>> SMT that means behavior would be same fully busy right? That can cause some
>>>> corner cases. No?
>>>
>>> You raised a good point. I was looking from SMT2
>>> perspective so group_smt_balance implies group_fully_busy.
>>> That is no longer true for SMT4.
>>>
>>> I am thinking of the following fix on the current patch
>>> to take care of SMT4. Do you think this addresses
>>
>> Thanks Tim for taking a look at it again.
>>
>> Yes. I think this would address some of the corner cases.
>> Any SMT4 group having 2,3,4 will have smt_balance as the group type, and busiest one
>> is the one which has least number of idle cpu's. (same conditions as group_has_spare)
>>
>>
>>
>>
>>> concerns from you and Tobias?
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 294a662c9410..3fc8d3a3bd22 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>> break;
>>>
>>> case group_smt_balance:
>>> + /* no idle cpus on both groups handled by group_fully_busy below */
>>> + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>> + if (sgs->idle_cpus > busiest->idle_cpus)
>>> + return false;
>>> + if (sgs->idle_cpus < busiest->idle_cpus)
>>> + return true;
>>> + if (sgs->sum_nr_running <= busiest_sum_nr_running)
>>> + return false;
>>> + else
>>> + return true;
>>> + }
>>>
>>>
>>> I will be on vacation next three weeks so my response will be slow.
>>>
>>> Tim
>>>
>>>>
>>
>> Small suggestion to above code to avoid compiler warning of switch case falling
>> through and else case can be removed, since update_sd_pick_busiest by default returns true.
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e5a75c76bcaa..ae364ac6f22e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9728,9 +9728,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>> return true;
>> if (sgs->sum_nr_running <= busiest->sum_nr_running)
>> return false;
>> - else
>> - return true;
>> }
>> + break;
>> +
>> case group_fully_busy:
>> /*
>> * Select the fully busy group with highest avg_load. In
>>
>>
>
> Can someone please send a full patch for this? I've already queued Tim's
> patches in tip/sched/core (tip-bot seems to have died somewhere last
> week, it's being worked on).

Hi Peter.

Sending on behalf of tim. I have included my suggestion as well. Hope that's ok.
Please find below the patch as of now. it includes the couple of changes that are discussed. (in 1/6 and in 3/6)


---
kernel/sched/fair.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 932e7b78894a..9502013abe33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
imbalance /= ncores_local + ncores_busiest;

/* Take advantage of resource in an empty sched group */
- if (imbalance == 0 && local->sum_nr_running == 0 &&
+ if (imbalance <= 1 && local->sum_nr_running == 0 &&
busiest->sum_nr_running > 1)
imbalance = 2;

@@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
break;

case group_smt_balance:
+ /* no idle cpus on both groups handled by group_fully_busy below */
+ if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
+ if (sgs->idle_cpus > busiest->idle_cpus)
+ return false;
+ if (sgs->idle_cpus < busiest->idle_cpus)
+ return true;
+ if (sgs->sum_nr_running <= busiest->sum_nr_running)
+ return false;
+ }
+ break;
+
case group_fully_busy:
/*
* Select the fully busy group with highest avg_load. In
--
2.31.1