Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

From: K Prateek Nayak
Date: Wed Oct 04 2023 - 00:21:42 EST


Hello David,

Thank you for answering my queries, I'll leave some data below to
answer yours.

On 9/29/2023 10:31 PM, David Vernet wrote:
> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>>> Hi Prateek,
>>>
>>>> Even with the two patches, I still observe the following lock
>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>
>>>> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>> - 10.94% native_queued_spin_lock_slowpath
>>>> - 10.73% _raw_spin_lock
>>>> - 9.57% __schedule
>>>> schedule_idle
>>>> do_idle
>>>> + cpu_startup_entry
>>>> - 0.82% task_rq_lock
>>>> newidle_balance
>>>> pick_next_task_fair
>>>> __schedule
>>>> schedule_idle
>>>> do_idle
>>>> + cpu_startup_entry
>>>>
>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>> towards the solution, this experiment introduces a per-shard
>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>> notifies of the possibility of one or more rq covered in the shard's
>>>> domain having a queued task. shard's overload flag is set at the same
>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>> to be empty.
>>>
>>> I think this is an interesting idea, but I feel that it's still working
>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>> conservation.
>>
>> I don't think so! Work conservation is possible if there is an
>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
>
> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> in the shared runq if they were being run. They would be dequeued from
> the shared_runq in __dequeue_entity(), which would be called from
> set_next_entity() before they were run. In this case, the
> shard->overload check should be equivalent to the
> !list_empty(&shard->list) check.
>
> Oh, or is the idea that we're not bothering to pull them from the
> shared_runq if they're being woken up and enqueued on an idle core that
> will immediately run them on the next resched path? If so, I wonder if
> we would instead just want to not enqueue the task in the shared_runq at
> all? Consider that if another task comes in on an rq with
> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> were being woken up on idle cores (nor take the overhead of inserting
> and then immediately removing them from the shared_runq).

So this is the breakdown of outcomes after peeking into the shared_runq
during newidle_balance:

SHARED_RUNQ SHARED_RUNQ
+ correct cost accounting + correct cost accounting
+ rq->avg_idle early bail

tbench throughput (normalized) : 1.00 2.47 (146.84%)

attempts : 6,560,413 2,273,334 (-65.35%)
shared_runq was empty : 2,276,307 [34.70%] 1,379,071 [60.66%] (-39.42%)
successful at pulling task : 2,557,158 [38/98%] 342,839 [15.08%] (-86.59%)
unsuccessful despite fetching task : 1,726,948 [26.32%] 551,424 [24.26%] (-68.06%)

As you can see, there are more attempts and a greater chance of success
in the case without the rq->avg_idle check upfront. Where the problem
lies (at least what I believe is) a task is waiting to be enqueued / has
been enqueued while we are trying to migrate a task fetched from the
shared_runq. Thus, instead of just being idle for a short duration and
running the task, we are now making it wait till we fetch another task
onto the CPU.

I think the scenario changes as follows with shared_runq:

- Current


[Short Idling] [2 tasks] [1 task] [2 tasks]
+-------+ +-------+ +-------+ +-------+
| | | | wakeup | | | |
| CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
| | | | --------> | | | |
+-------+ +-------+ +-------+ +-------+

- With shared_runq

[pull from CPU1] [2 tasks] [2 tasks] [1 task]
+-------+ +-------+ +-------+ +-------+
| | | | wakeup | | | |
| CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
| | | | --------> | | | |
+-------+ +-------+ +-------+ +-------+

We reach a similar final state but with shared_runq we've paid a price
for task migration. Worst case, the following timeline can happen:

|
CPU0 | [T0 R, T1 Q] [ T0 R ] [newidle_balance] [T4 R ...
|
| pull T1 \ pull T4 /
|
CPU1 | [T3 R] [newidle_balance] [T1 R, T4 Q] [ T1 R ]
| [T4 TTWU]
|

With the rq->avg_idle bailout, it might end up looking like:

|
CPU0 | [ T0 R, T1 Q ] [T1 R ...
|
|
CPU1 | [T3 R] [ I ] [T4 R ...
|
|

If possible, can you check how long is the avg_idle running your
workload? Meanwhile, I believe there are a few workloads that
exhibit same behavior as tbench (large scale idling for short
duration) Let me go check if I can see tbench like issue there.

>
>> idle. Work is conserved. What we need to worry about is tasks being in
>> the shared_runq that are queued on their respective CPU. This can only
>> happen if any one of the rq has nr_running >= 2, which is also the point
>> we are setting "shard->overload".
>
> Assuming this is about the "wakeup / enqueue to idle core" case, ok,
> this makes sense. I still think it probably makes more sense to just not
> enqueue in the shared_runq for this case though, which would allow us to
> instead just rely on list_empty(&shard->list).
>
>> Now situation can change later and all tasks in the shared_runq might be
>> running on respective CPUs but "shard->overload" is only cleared when
>> the shared_runq becomes empty. If this is too late, maybe we can clear
>> it if periodic load balancing finds no queuing (somewhere around the
>> time we update nr_idle_scan).
>>
>> So the window where we do not go ahead with popping a task from the
>> shared_runq_shard->list is between the list being empty and at least one
>> of the CPU associated with the shard reporting nr_running >= 2, which is
>> when work conservation is needed.
>
> So, I misread your patch the first time I reviewed it, and for some
> reason thought you were only setting shard->overload on the
> load_balance(). That's obviously not the case, and I now understand it
> better, modulo my points above being clarified.
>
>>>
>>>> With these changes, following are the results for tbench 128-clients:
>>>
>>> Just to make sure I understand, this is to address the contention we're
>>> observing on tbench with 64 - 256 clients, right? That's my
>>> understanding from Gautham's reply in [0].
>>>
>>> [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@xxxxxxxxxxxxxxxxxxxxxx/
>>
>> I'm not sure if Gautham saw a contention with IBS but he did see an
>> abnormal blowup in newidle_balance() counts which he suspected were the
>> cause for the regression. I noticed the rq lock contention after doing a
>> fair bit of surgery. Let me go check if that was the case with vanilla
>> v3 too.
>>
>>>
>>> If so, are we sure this change won't regress other workloads that would
>>> have benefited from the work conservation?
>>
>> I don't think we'll regress any workloads as I explained above because
>> the "overload" flag being 0 almost (since update/access is not atomic)
>> always indicate a case where the tasks cannot be pulled. However, that
>> needs to be tested since there is a small behavior change in
>> shared_runq_pick_next_task(). Where previously if the task is running
>> on CPU, we would have popped it from shared_runq, did some lock
>> fiddling before finding out it is running, some more lock fiddling
>> before the function returned "-1", now with the changes here, it'll
>> simply return a "0" and although that is correct, we have seen some
>> interesting cases in past [1] where a random lock contention actually
>> helps certain benchmarks ¯\_(ツ)_/¯
>
> I don't think we need to worry about less lock contention possibly
> hurting benchmarks :-)

Yup :)

>
>> [1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@xxxxxxx/
>>
>>>
>>> Also, I assume that you don't see the improved contention without this,
>>> even if you include your fix to the newidle_balance() that has us skip
>>> over the <= LLC domain?
>>
>> No improvements! The lock is still very much contended for. I wonder if
>> it could be because of the unlocking and locking the rq again in
>> shared_runq_pick_next_task() even when task is on CPU. Also since it
>> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
>> which can have further implications.
>
> Yeah, I could see it being an issue if we're essentially thrashing tasks
> in the shared_runq that are just temporarily enqueued in the shared_runq
> between activate and doing a resched pass on an idle core.
>
> Unfortunately, I don't think we have any choice but to drop and
> reacquire the rq lock. It's not safe to look at task_cpu(p) without
> pi_lock, and we can't safely acquire that without dropping the rq lock.

True that. We wouldn't want to run into a deadlock scenario or cause
more lock contention with double locking :(

>
> Thanks,
> David

--
Thanks and Regards,
Prateek