Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Wed Apr 12 2023 - 10:02:32 EST


On Wed, Apr 12, 2023 at 01:59:36PM +0200, Peter Zijlstra wrote:
> On Mon, Mar 27, 2023 at 01:39:55PM +0800, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
> >
> > 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group
> > 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg
> >
> > While cpus of the other node normally sees a lower cycle percent:
> >
> > 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group
> > 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg
> >
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side.
> >
> > The reason why only cpus of one node has bigger overhead is: task_group
> > is allocated on demand from a slab and whichever cpu happens to do the
> > allocation, the allocated tg will be located on that node and accessing
> > to tg->load_avg will have a lower cost for cpus on the same node and
> > a higer cost for cpus of the remote node.
> >
> > Tim Chen told me that PeterZ once mentioned a way to solve a similar
> > problem by making a counter per node so do the same for tg->load_avg.
>
> Yeah, I send him a very similar patch (except horrible) some 5 years ago
> for testing.
>
> > After this change, the worst number I saw during a 5 minutes run from
> > both nodes are:
> >
> > 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg
> > 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group
>
> Nice!

:-)

> > Another observation of this workload is: it has a lot of wakeup time
> > task migrations and that is the reason why update_load_avg() and
> > update_cfs_group() shows noticeable cost. Running this workload in N
> > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> > task migrations on wake up time are greatly reduced and the overhead from
> > the two above mentioned functions also dropped a lot. It's not clear to
> > me why running in multiple instances can reduce task migrations on
> > wakeup path yet.
>
> If there is *any* idle time, we're rather agressive at moving tasks to
> idle CPUs in an attempt to avoid said idle time. If you're running at
> about the number of CPUs there will be a fair amount of idle time and
> hence significant migrations.

Yes indeed.

> When you overload, there will no longer be idle time and hence no more
> migrations.

True. My later profile showed the multi-instance case has much lower
idle time compared to 1 instance setup, 0.4%-2% vs ~20%, and thus much
fewer migrations on wakeup, thousands vs millions in a 5s window.

> > Reported-by: Nitin Tekchandani <nitin.tekchandani@xxxxxxxxx>
> > Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx>
>
> If you want to make things more complicated you can check
> num_possible_nodes()==1 on boot and then avoid the indirection, but

Ah right, will think about how to achieve this.

Thanks for your comments.