Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: Peter Zijlstra
Date: Mon Jun 05 2023 - 13:56:51 EST


On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
> On 31.05.2023 14:04, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> > Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
> >
> > sched/fair: Multi-LLC select_idle_sibling()
> >
> > Tejun reported that when he targets workqueues towards a specific LLC
> > on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> > significant idle time.
> >
> > This is, of course, because of how select_idle_sibling() will not
> > consider anything outside of the local LLC, and since all these tasks
> > are short running the periodic idle load balancer is ineffective.
> >
> > And while it is good to keep work cache local, it is better to not
> > have significant idle time. Therefore, have select_idle_sibling() try
> > other LLCs inside the same node when the local one comes up empty.
> >
> > Reported-by: Tejun Heo <tj@xxxxxxxxxx>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>
> This patch landed in today's linux next-20230605 as commit c5214e13ad60
> ("sched/fair: Multi-LLC select_idle_sibling()"). Unfortunately it causes
> regression on my ARM 64bit Exynos5433-based TM2e test board during the
> CPU hotplug tests. From time to time I get the NULL pointer dereference.
> Reverting $subject on top of linux-next fixes the issue. Let me know if
> I can help somehow debugging this issue. Here is a complete log (I've
> intentionally kept all the stack dumps, although they don't look very
> relevant...):

Moo... OK, since our friends from AMD need some tuning on this anyway,
i'm going to pull the patch entirely. And we'll try again once they've
sorted out the best way to do this.