Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

From: David Vernet
Date: Fri Sep 29 2023 - 13:01:22 EST


On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
> Hello David,
>
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >
> > Hi Prateek,
> >
> >> Even with the two patches, I still observe the following lock
> >> contention when profiling the tbench 128-clients run with IBS:
> >>
> >> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >> - 10.94% native_queued_spin_lock_slowpath
> >> - 10.73% _raw_spin_lock
> >> - 9.57% __schedule
> >> schedule_idle
> >> do_idle
> >> + cpu_startup_entry
> >> - 0.82% task_rq_lock
> >> newidle_balance
> >> pick_next_task_fair
> >> __schedule
> >> schedule_idle
> >> do_idle
> >> + cpu_startup_entry
> >>
> >> Since David mentioned rq->avg_idle check is probably not the right step
> >> towards the solution, this experiment introduces a per-shard
> >> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >> notifies of the possibility of one or more rq covered in the shard's
> >> domain having a queued task. shard's overload flag is set at the same
> >> time as "rq->rd->overload", and is cleared when shard's list is found
> >> to be empty.
> >
> > I think this is an interesting idea, but I feel that it's still working
> > against the core proposition of SHARED_RUNQ, which is to enable work
> > conservation.
>
> I don't think so! Work conservation is possible if there is an
> imbalance. Consider the case where we 15 tasks in the shared_runq but we
> have 16 CPUs, 15 of which are running these 15 tasks, and one going

I'm not sure I'm fully following. Those 15 tasks would not be enqueued
in the shared runq if they were being run. They would be dequeued from
the shared_runq in __dequeue_entity(), which would be called from
set_next_entity() before they were run. In this case, the
shard->overload check should be equivalent to the
!list_empty(&shard->list) check.

Oh, or is the idea that we're not bothering to pull them from the
shared_runq if they're being woken up and enqueued on an idle core that
will immediately run them on the next resched path? If so, I wonder if
we would instead just want to not enqueue the task in the shared_runq at
all? Consider that if another task comes in on an rq with
rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
were being woken up on idle cores (nor take the overhead of inserting
and then immediately removing them from the shared_runq).

> idle. Work is conserved. What we need to worry about is tasks being in
> the shared_runq that are queued on their respective CPU. This can only
> happen if any one of the rq has nr_running >= 2, which is also the point
> we are setting "shard->overload".

Assuming this is about the "wakeup / enqueue to idle core" case, ok,
this makes sense. I still think it probably makes more sense to just not
enqueue in the shared_runq for this case though, which would allow us to
instead just rely on list_empty(&shard->list).

> Now situation can change later and all tasks in the shared_runq might be
> running on respective CPUs but "shard->overload" is only cleared when
> the shared_runq becomes empty. If this is too late, maybe we can clear
> it if periodic load balancing finds no queuing (somewhere around the
> time we update nr_idle_scan).
>
> So the window where we do not go ahead with popping a task from the
> shared_runq_shard->list is between the list being empty and at least one
> of the CPU associated with the shard reporting nr_running >= 2, which is
> when work conservation is needed.

So, I misread your patch the first time I reviewed it, and for some
reason thought you were only setting shard->overload on the
load_balance(). That's obviously not the case, and I now understand it
better, modulo my points above being clarified.

> >
> >> With these changes, following are the results for tbench 128-clients:
> >
> > Just to make sure I understand, this is to address the contention we're
> > observing on tbench with 64 - 256 clients, right? That's my
> > understanding from Gautham's reply in [0].
> >
> > [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@xxxxxxxxxxxxxxxxxxxxxx/
>
> I'm not sure if Gautham saw a contention with IBS but he did see an
> abnormal blowup in newidle_balance() counts which he suspected were the
> cause for the regression. I noticed the rq lock contention after doing a
> fair bit of surgery. Let me go check if that was the case with vanilla
> v3 too.
>
> >
> > If so, are we sure this change won't regress other workloads that would
> > have benefited from the work conservation?
>
> I don't think we'll regress any workloads as I explained above because
> the "overload" flag being 0 almost (since update/access is not atomic)
> always indicate a case where the tasks cannot be pulled. However, that
> needs to be tested since there is a small behavior change in
> shared_runq_pick_next_task(). Where previously if the task is running
> on CPU, we would have popped it from shared_runq, did some lock
> fiddling before finding out it is running, some more lock fiddling
> before the function returned "-1", now with the changes here, it'll
> simply return a "0" and although that is correct, we have seen some
> interesting cases in past [1] where a random lock contention actually
> helps certain benchmarks ¯\_(ツ)_/¯

I don't think we need to worry about less lock contention possibly
hurting benchmarks :-)

> [1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@xxxxxxx/
>
> >
> > Also, I assume that you don't see the improved contention without this,
> > even if you include your fix to the newidle_balance() that has us skip
> > over the <= LLC domain?
>
> No improvements! The lock is still very much contended for. I wonder if
> it could be because of the unlocking and locking the rq again in
> shared_runq_pick_next_task() even when task is on CPU. Also since it
> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
> which can have further implications.

Yeah, I could see it being an issue if we're essentially thrashing tasks
in the shared_runq that are just temporarily enqueued in the shared_runq
between activate and doing a resched pass on an idle core.

Unfortunately, I don't think we have any choice but to drop and
reacquire the rq lock. It's not safe to look at task_cpu(p) without
pi_lock, and we can't safely acquire that without dropping the rq lock.

Thanks,
David