Re: [PATCH 1/2] sched/core: Cookied forceidle accounting per cpu

From: Josh Don
Date: Tue Jan 04 2022 - 20:48:28 EST


Hi Cruz,

Could you add a bit more background to help me understand what case
this patch solves? Is your issue that you want to be able to
attribute forced idle time to the specific cpus it happens on, or do
you simply want a more convenient way of summing forced idle without
iterating your cookie'd tasks and summing the schedstat manually?

> @@ -190,6 +202,9 @@ static int show_stat(struct seq_file *p, void *v)
> seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal));
> seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest));
> seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
> +#ifdef CONFIG_SCHED_CORE
> + seq_put_decimal_ull(p, " ", nsec_to_clock_t(cookied_forceidle));
> +#endif

IMO it would be better to always print this stat, otherwise it sets a
weird precedent for new stats added in the future (much more difficult
for userspace to reason about which column corresponds with which
field, since it would depend on kernel config).

Also, did you intend to put this in /proc/stat instead of
/proc/schedstat (the latter of which would be more attractive to
prevent calculation of these stats unless schestat was enabled)?

> diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
> @@ -260,6 +261,21 @@ void __sched_core_account_forceidle(struct rq *rq)
>
> rq->core->core_forceidle_start = now;
>
> + for_each_cpu(i, smt_mask) {
> + rq_i = cpu_rq(i);
> + p = rq_i->core_pick ?: rq_i->curr;
> +
> + if (!rq->core->core_cookie)
> + continue;

I see this is temporary given your other patch, but just a note that
if your other patch is dropped, this check can be pulled outside the
loop.

> + if (p == rq_i->idle && rq_i->nr_running) {
> + cpustat = kcpustat_cpu(i).cpustat;
> + cpustat[CPUTIME_COOKIED_FORCEIDLE] += delta;
> + }
> + }

I don't think this is right. If a cpu was simply idle while some other
SMT sibling on its core was forced idle, and then a task happens to
wake on the idle cpu, that cpu will now be charged the full delta here
as forced idle (when actually it was never forced idle, we just
haven't been through pick_next_task yet). One workaround would be to
add a boolean to struct rq to cache whether the rq was in forced idle
state.

Best,
Josh