Re: [External] Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search

From: Abel Wu
Date: Tue Jun 21 2022 - 23:52:30 EST


Hi Chen, thanks for your comments!

On 6/22/22 2:14 AM, Chen Yu Wrote:
...
Reuse the data from load balance to select the unoccupied candidate
is applicable IMO, which is also aligned with SIS_UTIL path. And I have
a question regarding the update frequency. In v3 patch, the update is
based on periodic tick, which is triggered at most every 1ms(CONFIG_HZ_1000).
Then the periodic SMT load balance is launched every smt_weight ms, usually 2ms.
I expect the 2ms is of the same scale unit as 1ms, and since task tick based
update is acceptable, would excluding the CPU_NEWLY_IDLE balance from this patch
reduce the overhead meanwhile not bring too much inaccuracy?

I doubt periodic balancing entry is enough. The ms-scale could still
be too large for wakeup path. Say one cpu becomes newly idle just after
an update, then it keeps untouchable until next update which is nearly
2ms (even worse in SMT4/8 case) wasting time-slices to do nothing. So
newly-idle update is important to keep the filter fresh. And the false
positive correction is there to avoid excessive updates due to newly
idle, by allowing the false positive cpus to stay in the filter for a
little longer.

@@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
* No need to call idle_cpu() if nr_running is not 0
*/
if (!nr_running && idle_cpu(i)) {
+ /*
+ * Prefer the last idle cpu by overwriting
+ * preious one. The first idle cpu in this
+ * domain (if any) can trigger balancing
+ * and fed with tasks, so we'd better choose
+ * a candidate in an opposite way.
+ */
+ sds->idle_cpu = i;
Does it mean, only 1 idle CPU in the smt domain would be set to the
idle cpu mask at one time? For SMT4/8 we might lose track of the
idle siblings.

Yes. The intention of one-at-a-time propagation is
1) help spreading out load to different cores
2) reduce some update overhead
In this way, if the filter contains several cpus of a core, ideally
we can sure about that at least one of them is actually unoccupied.
For SMT4/8 we still have newly idle balance to make things right.

sgs->idle_cpus++;
+
/* Idle cpu can't have misfit task */
continue;
}
@@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
{
- if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
- set_idle_cpus(env->dst_cpu, true);
+ struct sched_domain_shared *sd_smt_shared = env->sd->shared;
+ enum sd_state new = sds->sd_state;
+ int this = env->dst_cpu;
+
+ /*
+ * Parallel updating can hardly contribute accuracy to
+ * the filter, besides it can be one of the burdens on
+ * cache traffic.
+ */
+ if (cmpxchg(&sd_smt_shared->updating, 0, 1))
+ return;
+
+ /*
+ * There is at least one unoccupied cpu available, so
+ * propagate it to the filter to avoid false negative
+ * issue which could result in lost tracking of some
+ * idle cpus thus throughupt downgraded.
+ */
+ if (new != sd_is_busy) {
+ if (!test_idle_cpus(this))
+ set_idle_cpus(this, true);
+ } else {
+ /*
+ * Nothing changes so nothing to update or
+ * propagate.
+ */
+ if (sd_smt_shared->state == sd_is_busy)
+ goto out;
+ }
+
+ sd_update_icpus(this, sds->idle_cpu);
I wonder if we could further enhance it to facilitate idle CPU scan.
For example, can we propagate the idle CPUs in smt domain, to its parent
domain in a hierarchic sequence, and finally to the LLC domain. If there is

In fact, it was my first try to cache the unoccupied cpus in SMT
shared domain, but the overhead of cpumask ops seems like a major
stumbling block.

a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
could benefit from this mechanism.
https://lore.kernel.org/lkml/20220609120622.47724-3-yangyicong@xxxxxxxxxxxxx/

Putting SIS into a hierarchical pattern is good for cache locality.
But I don't think multi-level filter is appropriate since it could
bring too much cache traffic in SIS, and it could be expected to be
a disaster for netperf/tbench or the workloads suffering frequent
context switches.


Furthermore, even if there is no cluster domain, would a 'virtual' middle
sched domain would help reduce the contention?
Core0(CPU0,CPU1),Core1(CPU2,CPU3) Core2(CPU4,CPU5) Core3(CPU6,CPU7)
We can create cpumask1, which is composed of Core0 and Core1, and cpumask2
which is composed of Core2 and Core3. The SIS would first scan in cpumask1,
if idle cpu is not found, scan cpumask2. In this way, the CPUs in Core0 and
Core1 only updates cpumask1, without competing with Core2 and Core3 on cpumask2.

Yes, this is the best case, but the worst case is something that
we probably can't afford.

Thanks & BR,
Abel