Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS

From: K Prateek Nayak
Date: Wed Aug 30 2023 - 23:48:18 EST


Hello David,

On 8/31/2023 7:04 AM, David Vernet wrote:
> On Wed, Aug 30, 2023 at 12:16:17PM +0530, K Prateek Nayak wrote:
>> Hello David,
>
> Hello Prateek,
>
>>
>> On 8/10/2023 3:42 AM, David Vernet wrote:
>>> [..snip..]
>>> + if (p && is_cpu_allowed(p, cpu_of(rq)))
>>> + list_del_init(&p->shared_runq_node);
>>
>> I wonder if we should remove the task from the list if
>> "is_cpu_allowed()" return false.
>>
>> Consider the following scenario: A task that does not sleep, is pinned
>> to single CPU. Since this is now at the head of the list, and cannot be
>> moved, we leave it there, but since the task also never sleeps, it'll
>> stay there, thus preventing the queue from doing its job.
>
> Hmm, sorry, I may not be understanding your suggestion. If a task was
> pinned to a single CPU, it would be dequeued from the shared_runq before
> being pinned (see __set_cpus_allowed_ptr_locked()), and then would not
> be added back to the shard in shared_runq_enqueue_task() because of
> p->nr_cpus_allowed == 1. The task would also be dequeued from the shard
> before it started running (unless I'm misunderstanding what you mean by
> "a task that does not sleep"). Please let me know if I'm missing
> something.

Ah! My bad. Completely missed that. Thank you for clarifying.

>
>> Further implication ...
>>
>>> + else
>>> + p = NULL;
>>> + raw_spin_unlock(&shared_runq->lock);
>>> +
>>> + return p;
>>> +}
>>> +
>>> +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
>>> +{
>>> + struct shared_runq *shared_runq;
>>> +
>>> + shared_runq = rq_shared_runq(rq);
>>> + raw_spin_lock(&shared_runq->lock);
>>> + list_add_tail(&p->shared_runq_node, &shared_runq->list);
>>> + raw_spin_unlock(&shared_runq->lock);
>>> +}
>>>
>>> static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
>>> -{}
>>> +{
>>> + /*
>>> + * Only enqueue the task in the shared runqueue if:
>>> + *
>>> + * - SHARED_RUNQ is enabled
>>> + * - The task isn't pinned to a specific CPU
>>> + */
>>> + if (p->nr_cpus_allowed == 1)
>>> + return;
>>> +
>>> + shared_runq_push_task(rq, p);
>>> +}
>>>
>>> static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>> {
>>> - return 0;
>>> + struct task_struct *p = NULL;
>>> + struct rq *src_rq;
>>> + struct rq_flags src_rf;
>>> + int ret = -1;
>>> +
>>> + p = shared_runq_pop_task(rq);
>>> + if (!p)
>>> + return 0;
>>
>> ...
>>
>> Since we return 0 here in such a scenario, we'll take the old
>> newidle_balance() path but ...
>>
>>> +
>>> + rq_unpin_lock(rq, rf);
>>> + raw_spin_rq_unlock(rq);
>>> +
>>> + src_rq = task_rq_lock(p, &src_rf);
>>> +
>>> + if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>>> + update_rq_clock(src_rq);
>>> + src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
>>> + ret = 1;
>>> + }
>>> +
>>> + if (src_rq != rq) {
>>> + task_rq_unlock(src_rq, p, &src_rf);
>>> + raw_spin_rq_lock(rq);
>>> + } else {
>>> + rq_unpin_lock(rq, &src_rf);
>>> + raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>> + }
>>> + rq_repin_lock(rq, rf);
>>> +
>>> + return ret;
>>> }
>>>
>>> static void shared_runq_dequeue_task(struct task_struct *p)
>>> -{}
>>> +{
>>> + struct shared_runq *shared_runq;
>>> +
>>> + if (!list_empty(&p->shared_runq_node)) {
>>> + shared_runq = rq_shared_runq(task_rq(p));
>>> + raw_spin_lock(&shared_runq->lock);
>>> + /*
>>> + * Need to double-check for the list being empty to avoid
>>> + * racing with the list being drained on the domain recreation
>>> + * or SHARED_RUNQ feature enable / disable path.
>>> + */
>>> + if (likely(!list_empty(&p->shared_runq_node)))
>>> + list_del_init(&p->shared_runq_node);
>>> + raw_spin_unlock(&shared_runq->lock);
>>> + }
>>> +}
>>>
>>> /*
>>> * For asym packing, by default the lower numbered CPU has higher priority.
>>> @@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>> rcu_read_lock();
>>> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>>
>>> + /*
>>> + * Skip <= LLC domains as they likely won't have any tasks if the
>>> + * shared runq is empty.
>>> + */
>>
>> ... now we skip all the way ahead of MC domain, overlooking any
>> imbalance that might still exist within the SMT and MC groups
>> since shared runq is not exactly empty.
>>
>> Let me know if I've got something wrong!
>
> Yep, I mentioned this to Gautham as well in [0].
>
> [0]: https://lore.kernel.org/all/20230818050355.GA5718@maniforge/
>
> I agree that I think we should remove this heuristic from v4. Either
> that, or add logic to iterate over the shared_runq until a viable task
> is found.
>
>>
>>> + if (sched_feat(SHARED_RUNQ)) {
>>> + sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>> + if (likely(sd))
>>> + sd = sd->parent;
>>> + }
>>
>> Speaking of skipping ahead of MC domain, I don't think this actually
>> works since the domain traversal uses the "for_each_domain" macro
>> which is defined as:
>
> *blinks*
>
> Uhhh, yeah, wow. Good catch!
>
>> #define for_each_domain(cpu, __sd) \
>> for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
>> __sd; __sd = __sd->parent)
>>
>> The traversal starts from rq->sd overwriting your initialized value
>> here. This is why we see "load_balance count on cpu newly idle" in
>> Gautham's first report
>> (https://lore.kernel.org/lkml/ZN3dW5Gvcb0LFWjs@xxxxxxxxxxxxxxxxxxxxxx/)
>> to be non-zero.
>>
>> One way to do this would be as follows:
>>
>> static int newidle_balance() {
>>
>> ...
>> for_each_domain(this_cpu, sd) {
>>
>> ...
>> /* Skip balancing until LLc domain */
>> if (sched_feat(SHARED_RUNQ) &&
>> (sd->flags & SD_SHARE_PKG_RESOURCES))
>> continue;
>>
>> ...
>> }
>> ...
>> }
>
> Yep, I think this makes sense to do.
>
>> With this I see the newidle balance count for SMT and MC domain
>> to be zero:
>
> And indeed, I think this was the intention. Thanks again for catching
> this. I'm excited to try this out when running benchmarks for v4.
>
>> < ---------------------------------------- Category: newidle (SMT) ---------------------------------------- >
>> load_balance cnt on cpu newly idle : 0 $ 0.000 $ [ 0.00000 ]
>> --
>> < ---------------------------------------- Category: newidle (MC) ---------------------------------------- >
>> load_balance cnt on cpu newly idle : 0 $ 0.000 $ [ 0.00000 ]
>> --
>> < ---------------------------------------- Category: newidle (DIE) ---------------------------------------- >
>> load_balance cnt on cpu newly idle : 2170 $ 9.319 $ [ 17.42832 ]
>> --
>> < ---------------------------------------- Category: newidle (NUMA) ---------------------------------------- >
>> load_balance cnt on cpu newly idle : 30 $ 674.067 $ [ 0.24094 ]
>> --
>>
>> Let me know if I'm missing something here :)
>
> No, I think you're correct, we should be doing this. Assuming we want to
> keep this heuristic, I think the block above is also correct so that we
> properly account sd->max_newidle_lb_cost and rq->next_balance. Does that
> make sense to you too?

Yup, that makes sense!

>
>>
>> Note: The lb counts for DIE and NUMA are down since I'm experimenting
>> with the implementation currently. I'll update of any new findings on
>> the thread.
>
> Ack, thank you for doing that.
>
> Just FYI, I'll be on vacation for about 1.5 weeks starting tomorrow
> afternoon. If I'm slow to respond, that's why.

Enjoy your vacation :)

I'll keep experimenting meanwhile and update the report in the
parallel thread.

>
> Thanks,
> David

--
Thanks and Regards,
Prateek