Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

From: K Prateek Nayak
Date: Thu Oct 05 2023 - 09:56:35 EST


Hello David,

On 10/4/2023 10:50 PM, David Vernet wrote:
> On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
>> Hello David,
>
> Hello Prateek,
>
>>
>> Thank you for answering my queries, I'll leave some data below to
>> answer yours.
>>
>> On 9/29/2023 10:31 PM, David Vernet wrote:
>>> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>>>> Hello David,
>>>>
>>>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>>>
>>>>> Hi Prateek,
>>>>>
>>>>>> Even with the two patches, I still observe the following lock
>>>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>>>
>>>>>> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>>>> - 10.94% native_queued_spin_lock_slowpath
>>>>>> - 10.73% _raw_spin_lock
>>>>>> - 9.57% __schedule
>>>>>> schedule_idle
>>>>>> do_idle
>>>>>> + cpu_startup_entry
>>>>>> - 0.82% task_rq_lock
>>>>>> newidle_balance
>>>>>> pick_next_task_fair
>>>>>> __schedule
>>>>>> schedule_idle
>>>>>> do_idle
>>>>>> + cpu_startup_entry
>>>>>>
>>>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>>>> towards the solution, this experiment introduces a per-shard
>>>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>>>> notifies of the possibility of one or more rq covered in the shard's
>>>>>> domain having a queued task. shard's overload flag is set at the same
>>>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>>>> to be empty.
>>>>>
>>>>> I think this is an interesting idea, but I feel that it's still working
>>>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>>>> conservation.
>>>>
>>>> I don't think so! Work conservation is possible if there is an
>>>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>>>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
>>>
>>> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
>>> in the shared runq if they were being run. They would be dequeued from
>>> the shared_runq in __dequeue_entity(), which would be called from
>>> set_next_entity() before they were run. In this case, the
>>> shard->overload check should be equivalent to the
>>> !list_empty(&shard->list) check.
>>>
>>> Oh, or is the idea that we're not bothering to pull them from the
>>> shared_runq if they're being woken up and enqueued on an idle core that
>>> will immediately run them on the next resched path? If so, I wonder if
>>> we would instead just want to not enqueue the task in the shared_runq at
>>> all? Consider that if another task comes in on an rq with
>>> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
>>> were being woken up on idle cores (nor take the overhead of inserting
>>> and then immediately removing them from the shared_runq).
>
> Friendly ping on this point. This is the only scenario where I could see
> the overload check helping, so I want to make sure I'm understanding it
> and am correct in that just avoiding enqueueing the task in the shard in
> this scenario would give us the same benefit.

Woops! Missed answering this. So the original motivation for
'shard->overload' was that there is a rq lock contention, very likely as
a result of shared_runq_pick_next_task() trying to grab a remote rq's
lock. Looking at shared_runq_pick_next_task(), the criteria
"!task_on_cpu(src_rq, p)" led me to believe we might end up enqueuing a
task that is running on the CPU but now that I take a look at
shared_runq_enqueue_task() being called from __enqueue_entity(), this
should be a very rare scenario.

However, if a running task was enqueued often into a shared_runq, the
'shard->overload' is an indication that all the runqueues covered by
the shard are not overloaded and hence, peeking into the shard can be
skipped. Let me see if I can grab some more stats to verify what
exactly is happening.

>
>> So this is the breakdown of outcomes after peeking into the shared_runq
>> during newidle_balance:
>>
>> SHARED_RUNQ SHARED_RUNQ
>> + correct cost accounting + correct cost accounting
>> + rq->avg_idle early bail
>>
>> tbench throughput (normalized) : 1.00 2.47 (146.84%)
>>
>> attempts : 6,560,413 2,273,334 (-65.35%)
>> shared_runq was empty : 2,276,307 [34.70%] 1,379,071 [60.66%] (-39.42%)
>> successful at pulling task : 2,557,158 [38/98%] 342,839 [15.08%] (-86.59%)
>> unsuccessful despite fetching task : 1,726,948 [26.32%] 551,424 [24.26%] (-68.06%)
>>
>> As you can see, there are more attempts and a greater chance of success
>> in the case without the rq->avg_idle check upfront. Where the problem
>> lies (at least what I believe is) a task is waiting to be enqueued / has
>> been enqueued while we are trying to migrate a task fetched from the
>> shared_runq. Thus, instead of just being idle for a short duration and
>> running the task, we are now making it wait till we fetch another task
>> onto the CPU.
>>
>> I think the scenario changes as follows with shared_runq:
>>
>> - Current
>>
>>
>> [Short Idling] [2 tasks] [1 task] [2 tasks]
>> +-------+ +-------+ +-------+ +-------+
>> | | | | wakeup | | | |
>> | CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
>> | | | | --------> | | | |
>> +-------+ +-------+ +-------+ +-------+
>>
>> - With shared_runq
>>
>> [pull from CPU1] [2 tasks] [2 tasks] [1 task]
>> +-------+ +-------+ +-------+ +-------+
>> | | | | wakeup | | | |
>> | CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
>> | | | | --------> | | | |
>> +-------+ +-------+ +-------+ +-------+
>>
>> We reach a similar final state but with shared_runq we've paid a price
>> for task migration. Worst case, the following timeline can happen:
>>
>> |
>> CPU0 | [T0 R, T1 Q] [ T0 R ] [newidle_balance] [T4 R ...
>> |
>> | pull T1 \ pull T4 /
>> |
>> CPU1 | [T3 R] [newidle_balance] [T1 R, T4 Q] [ T1 R ]
>> | [T4 TTWU]
>> |
>>
>> With the rq->avg_idle bailout, it might end up looking like:
>>
>> |
>> CPU0 | [ T0 R, T1 Q ] [T1 R ...
>> |
>> |
>> CPU1 | [T3 R] [ I ] [T4 R ...
>> |
>> |
>
> This certainly seems possible, and wouldn't be terribly surprising or
> unexpected. Taking a step back here, I want to be clear that I do
> understand the motivation for including the rq->avg_idle check for
> SHARED_RUNQ; even just conceptually, and regardless of the numbers you
> and others have observed for workloads that do these short sleeps. The
> whole idea behind that check is that we want to avoid doing
> newidle_balance() if the overhead of doing newidle_balance() would
> exceed the amount of time that a task was blocked. Makes sense. Why
> would you take the overhead of balancing if you have reason to believe
> that a task is likely to be idle for less time than it takes to do a
> migration?
>
> There's certainly a reasonable argument for why that should also apply
> to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
> greater than the amount of time that an sd is expected to be idle, then
> it's not worth bothering with SHARED_RUNQ either. On the other hand, the
> claim of SHARED_RUNQ is that it's faster than doing a regular balance
> pass, because we're doing an O(# shards) iteration to find tasks (before
> sharding it was O(1)), rather than O(# CPUs). So if we also do the
> rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
> cache for a full load_balance() call.
>
> Maybe that makes sense and is ultimately the correct design /
> implementation for the feature. I'm not fundamentally opposed to that,
> but I think we should be cognizant of the tradeoff we're making. If we
> don't include this rq->avg_idle check, then some workloads will regress
> because we're doing excessive migrations, but if we do check it, then
> others will also regress because we're doing insufficient migrations due
> to incorrectly assuming that an rq won't be idle for long. On yet
> another hand, maybe it's fine to allow users to work around that by
> setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
> because we ignore that and set rq->max_idle_balance_cost = curr_cost in
> newidle_balance() if we end up doing a balance pass. I also know that
> Peter and others discourage the use of these debugfs knobs, so I'm not
> sure it's even applicable to point that out as a workaround.
>
> And so hopefully the problem starts to become clear. It doesn't take
> long for for us to get mired in heuristics that make it difficult to
> reason about the expected behavior of the feature, and also difficult to
> reason about future changes as these heuristics have now all crossed
> streams. Maybe that's OK, and is preferable to the alternative. My
> personal opinion, however, is that it's preferable to provide users with
> knobs that do straightforward things that are independent from existing
> heuristics and knobs which were added for other circumstances. I'd
> rather have confidence that I understand how a feature is supposed to
> work, and can easily reason about when it's stupid (or not) to use it,
> vs. have an expectation for it to not regress workloads in any scenario.
>
> Note that this doesn't mean we can't make my patches less dumb. I think
> your suggestions to e.g. check the overload flag (or possibly even
> better to just not enqueue in a shard if the rq isn't overloaded),
> re-check ttwu->pending after failing to find a task in the shard, etc
> make complete sense. There's no downside -- we're just avoiding
> pointless work. It's the heuristics like checking rq->avg_idle that
> really worry me.

I agree since avg_idle is merely a prediction that may or may not be
true.

>
> Peter -- I think it would be helpful if you could weigh in here just to
> provide your thoughts on this more "philosophical" question.
>
>> If possible, can you check how long is the avg_idle running your
>> workload? Meanwhile, I believe there are a few workloads that
>> exhibit same behavior as tbench (large scale idling for short
>> duration) Let me go check if I can see tbench like issue there.
>
> Sure thing, in the meantime I'll test this out on HHVM. I've actually
> been working on getting a build + testbed ready for a few days, so
> hopefully it won't take much longer to get some results. Even if it
> turns out that this works great for HHVM, I'd ideally like to get
> Peter's and others' thoughts on the above.

I'll gather some more data too in the meantime :)

>
> Thanks,
> David

--
Thanks and Regards,
Prateek