Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance

From: K Prateek Nayak
Date: Thu Aug 31 2023 - 15:47:43 EST

Next message: Uladzislau Rezki: "Re: [PATCH v2 1/2] mm/vmalloc: Add a safer version of find_vm_area() for debug"
Previous message: Biju Das: "[PATCH v3 3/3] ASoC: tlv320aic32x4-spi: Simplify probe()"
In reply to: David Vernet: "Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance"
Next in thread: K Prateek Nayak: "[RFC PATCH 3/3] sched/fair: Add a per-shard overload flag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello David,

Thank you for taking a look at this despite being on vacation.

On 9/1/2023 12:15 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:07PM +0530, K Prateek Nayak wrote:
>> This patch takes the relevant optimizations from [1] in
>> newidle_balance(). Following is the breakdown:
>
> Thanks for working on this. I think the fix you added for skipping <=
> LLC domains makes sense. The others possibly as well

I too am in doubt with some of them but I left them in since I was
building on top of the cumulative diff.

> -- left some
> comments below!
>
>>
>> - Check "rq->rd->overload" before jumping into newidle_balance, even
>> with SHARED_RQ feat enabled.
>
> Out of curiosity -- did you observe this making a material difference in
> your tests? After thinking about it some more, though I see the argument
> for why it would be logical to check if we're overloaded, I'm still
> thinking that it's more ideal to just always check the SHARED_RUNQ.
> rd->overload is only set in find_busiest_group() when we load balance,
> so I worry that having SHARED_RUNQ follow rd->overload may just end up
> making it redundant with normal load balancing in many cases.
>
> So yeah, while I certainly understand the idea (and would like to better
> understand what kind of difference it made in your tests), I still feel
> pretty strongly that SHARED_RUNQ makes the most sense as a feature when
> it ignores all of these heuristics and just tries to maximize work
> conservation.
>
> What do you think?

Actually, as it turns out, it was probably a combination of the
rq->avg_idle check + updating of cost that got the performance back
during experimenting. In Patch 3, I've give the results with this patch
alone and it makes no difference, for tbench 128-client at least. There
is the same rq lock contention I mentioned previously which is why the
per-shard "overload" flag.

Based on Anna-Maria's observation in [1], we have a short idling, spread
across the system with tbench. Now it is possible we are doing a
newidle_balance() when it would have been better off to let the CPU idle
for that short duration without and not cause a contention for the rq
lock.

[1] https://lore.kernel.org/lkml/80956e8f-761e-b74-1c7a-3966f9e8d934@xxxxxxxxxxxxx/

>
>> - Call update_next_balance() for all the domains till MC Domain in
>> when SHARED_RQ path is taken.
>
> I _think_ this makes sense. Though even in this case, I feel that it may
> be slightly confusing and/or incorrect to push back the balance time
> just because we didn't find a task in our current CCX's shared_runq.
> Maybe we should avoid mucking with load balancing? Not sure, but I am
> leaning towards what you're proposing here as a better approach.

This requires a deeper look and more testing yes.

>
>> - Account cost from shared_runq_pick_next_task() and update
>> curr_cost and sd->max_newidle_lb_cost accordingly.
>
> Yep, I think this is the correct thing to do.
>
>>
>> - Move the initial rq_unpin_lock() logic around. Also, the caller of
>> shared_runq_pick_next_task() is responsible for calling
>> rq_repin_lock() if the return value is non zero. (Needs to be verified
>> everything is right with LOCKDEP)
>
> Still need to think more about this, but it's purely just tactical and
> can easily be fixed it we need.

I agree. I'll leave the full picture of this below in
[Locking code movement clarifications] since we seem to keep coming back
to this and it would be good to have more eyes on what is going on in my
mind :)

>
>>
>> - Includes a fix to skip directly above the LLC domain when calling the
>> load_balance() in newidle_balance()
>
> Big fix, thanks again for noticing it.
>
>> All other surgery from [1] has been removed.
>>
>> Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@xxxxxxx/ [1]
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>> ---
>> kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 67 insertions(+), 27 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bf844ffa79c2..446ffdad49e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>> rq_unpin_lock(rq, &src_rf);
>> raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>> }
>> - rq_repin_lock(rq, rf);
>>
>> return ret;
>> }
>> @@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>> if (!cpu_active(this_cpu))
>> return 0;
>>
>> - if (sched_feat(SHARED_RUNQ)) {
>> - pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> - if (pulled_task)
>> - return pulled_task;
>> - }
>> -
>> /*
>> * We must set idle_stamp _before_ calling idle_balance(), such that we
>> * measure the duration of idle_balance() as idle time.
>> */
>> this_rq->idle_stamp = rq_clock(this_rq);
>>
>> - /*
>> - * This is OK, because current is on_cpu, which avoids it being picked
>> - * for load-balance and preemption/IRQs are still disabled avoiding
>> - * further scheduler activity on it and we're being very careful to
>> - * re-start the picking loop.
>> - */
>> - rq_unpin_lock(this_rq, rf);
>> -
>> rcu_read_lock();
>> - sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> - /*
>> - * Skip <= LLC domains as they likely won't have any tasks if the
>> - * shared runq is empty.
>> - */
>> - if (sched_feat(SHARED_RUNQ)) {
>> + if (sched_feat(SHARED_RUNQ))
>> sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> - if (likely(sd))
>> - sd = sd->parent;
>> - }
>> + else
>> + sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>
>> if (!READ_ONCE(this_rq->rd->overload) ||
>> - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>> + /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
>> + (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>
>> - if (sd)
>> + while (sd) {
>> update_next_balance(sd, &next_balance);
>> + sd = sd->child;
>> + }
>> +
>> rcu_read_unlock();
>>
>> goto out;
>> }
>> +
>> + if (sched_feat(SHARED_RUNQ)) {
>> + struct sched_domain *tmp = sd;
>> +
>> + t0 = sched_clock_cpu(this_cpu);
>> +
>> + /* Do update_next_balance() for all domains within LLC */
>> + while (tmp) {
>> + update_next_balance(tmp, &next_balance);
>> + tmp = tmp->child;
>> + }
>> +
>> + pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> + if (pulled_task) {
>> + if (sd) {
>> + curr_cost = sched_clock_cpu(this_cpu) - t0;
>> + /*
>> + * Will help bail out of scans of higer domains
>> + * slightly earlier.
>> + */
>> + update_newidle_cost(sd, curr_cost);
>> + }
>> +
>> + rcu_read_unlock();
>> + goto out_swq;
>> + }
>> +
>> + if (sd) {
>> + t1 = sched_clock_cpu(this_cpu);
>> + curr_cost += t1 - t0;
>> + update_newidle_cost(sd, curr_cost);
>> + }
>> +
>> + /*
>> + * Since shared_runq_pick_next_task() can take a while
>> + * check if the CPU was targetted for a wakeup in the
>> + * meantime.
>> + */
>> + if (this_rq->ttwu_pending) {
>> + rcu_read_unlock();
>> + return 0;
>> + }
>
> At first I was wondering whether we should do this above
> update_newidle_cost(), but I think it makes sense to always call
> update_newidle_cost() after we've failed to get a task from
> shared_runq_pick_next_task().

Indeed. I think the cost might be useful to be accounted for.

>
>> + }
>> rcu_read_unlock();
>>
>> + /*
>> + * This is OK, because current is on_cpu, which avoids it being picked
>> + * for load-balance and preemption/IRQs are still disabled avoiding
>> + * further scheduler activity on it and we're being very careful to
>> + * re-start the picking loop.
>> + */
>> + rq_unpin_lock(this_rq, rf);
>
> Don't you need to do this before you exit on the rq->ttwu_pending path?

[Locking code movement clarifications]

Okay this is where I'll put all the locking bits I have in my head:

o First, the removal of rq_repin_lock() in shared_runq_pick_next_task()

Since this is only called from newidle_balance(), it is easy to
isolate the changes. shared_runq_pick_next_task() can return either
0, 1 or -1. The interpretation is same as return value of
newidle_balance():

0: Unsuccessful at pulling task but the rq lock was never released
and reacquired - it was held all the time.

1: Task was pulled successfully. The rq lock was released and
reacquired in the process but now, with the above changes, it is
not pinned.

-1: Unsuccessful at pulling task but the rq lock was released and
reacquired in the process and now, with the above changes, it is
not pinned.

Now the following block:

pulled_task = shared_runq_pick_next_task(this_rq, rf);
if (pulled_task) {
...
goto out_swq;
}

takes care of the case where return values are -1, or 1. The "out_swq"
label is almost towards the end of newidle_balance() and just before
returning, the newidle_balance() does:

rq_repin_lock(this_rq, rf);

So this path will repin the lock.

Now for the case where shared_runq_pick_next_task() return 0.

o Which brings us to the question you asked above

newidle_balance() is called with the rq lock held and pinned, and it
expects the same when newidle_balance() reruns. The very first bailout
check in newidle_balance() is:

if (this_rq->ttwu_pending)
return 0;

so we return without doing any changed to the state of rq lock.

Coming to the above changes, if we have to hit the ttwu_pending
bailout you pointed at, shared_runq_pick_next_task() should return 0,
signifying no modification to state of the lock or pinning. Then we
update the cost, and come to ttwu_pending check. We still have the
lock held, and it is pinned. Thus we do not need to unpin the lock
since we newidle_balance() is expected to return with lock held and
it being pinned.

Please let me know if I've missed something.

>
>> raw_spin_rq_unlock(this_rq);
>>
>> t0 = sched_clock_cpu(this_cpu);
>> @@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>> break;
>>
>> + /*
>> + * Skip <= LLC domains as they likely won't have any tasks if the
>> + * shared runq is empty.
>> + */
>> + if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> + continue;
>> +
>> if (sd->flags & SD_BALANCE_NEWIDLE) {
>>
>> pulled_task = load_balance(this_cpu, this_rq,
>> @@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>
>> raw_spin_rq_lock(this_rq);
>>
>> +out_swq:
>> if (curr_cost > this_rq->max_idle_balance_cost)
>> this_rq->max_idle_balance_cost = curr_cost;
>>
>
>
> Thanks,
> David

--
Thanks and Regards,
Prateek

Next message: Uladzislau Rezki: "Re: [PATCH v2 1/2] mm/vmalloc: Add a safer version of find_vm_area() for debug"
Previous message: Biju Das: "[PATCH v3 3/3] ASoC: tlv320aic32x4-spi: Simplify probe()"
In reply to: David Vernet: "Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance"
Next in thread: K Prateek Nayak: "[RFC PATCH 3/3] sched/fair: Add a per-shard overload flag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]