Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

From: Tim Chen
Date: Wed Sep 27 2023 - 17:36:18 EST


On Wed, 2023-09-27 at 10:00 +0200, Ingo Molnar wrote:
> * Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Inhibit the task migration while keeping the work conservation of
> > scheduler could benefit many workloads. Inspired by Mathieu's
> > proposal to limit the task migration ratio[1], this patch considers
> > the task average sleep duration. If the task is a short sleeping one,
> > then tag its previous CPU as cache hot for a short while. During this
> > reservation period, other wakees are not allowed to pick this idle CPU
> > until a timeout. Later if the task is woken up again, it can find its
> > previous CPU still idle, and choose it in select_idle_sibling().
>
> Yeah, so I'm not convinced about this at this stage.
>
> By allowing a task to basically hog a CPU after it has gone idle already,
> however briefly, we reduce resource utilization efficiency for the sake
> of singular benchmark workloads.
>
> In a mixed environment the cost of leaving CPUs idle longer than necessary
> will show up - and none of these benchmarks show that kind of side effect
> and indirect overhead.
>
> This feature would be a lot more convincing if it tried to measure overhead
> in the pathological case, not the case it's been written for.
>

Ingo,

Mathieu's patches on detecting overly high task migrations and then
rate limiting migration is a way to detect that tasks are getting 
crazy doing CPU musical chairs and in a pathological state.

Will the migration rate be a reasonable indicator that we need to
do something to reduce pathological migrations like SIS_CACHE proposal so the
tasks don't get jerked all over?
Or you have some other better indicators in mind?

We did some experiments on the OLTP workload on a 112 core 2 socket
SPR machine. The OLTP workload have a mixture of threads
handling database updates on disks and handling transaction
queries over network.

For Mathieu's original task migration rate limit patches,
we saw 1.2% improvement and for Chen Yu's SIS_CACHE proposal, we 
saw 0.7% improvement. System is running at
~94% busy so is under high utilization. The variation of this workload
is less than 0.2%. There are improvements for such mix workload
though it is not as much as the microbenchmarks. These
data are perliminary and we are still doing more experiments.

For the OLTP experiments, each socket with 64 cores are divided
with sub-numa clusters of 4 nodes of 16 cores each so the scheduling
overhead in idle CPU search is much less if SNC is off.

Thanks.

Tim