Re: [PATCH] sched/fair: Again ignore percpu threads for imbalance pulls

From: Valentin Schneider
Date: Thu Dec 16 2021 - 13:26:47 EST


On 11/12/21 17:48, Yihao Wu wrote:
> commit 2f5f4cce496e ("sched/fair: Ignore percpu threads for imbalance
> pulls") was meant to fix a performance issue, when load balance tries to
> migrate pinned kernel threads at MC domain level. This was destined to
> fail.

> After it fails, it further makes wakeup balance at NUMA domain level
> messed up. The most severe case that I noticed and frequently occurs:
> |sum_nr_running(node1) - sum_nr_running(node2)| > 100
>

Wakeup balance (aka find_idlest_cpu()) is different from periodic load
balance (aka load_balance()) and doesn't use can_migrate_task(), so the
incriminated commit shouldn't have impacted it (at least not in obvious
ways...). Do you have any more details on that issue?

> However the original bugfix failed, because it covers only case 1) below.
> 1) Created by create_kthread
> 2) Created by kernel_thread
> No kthread is assigned to task_struct in case 2 (Please refer to comments
> in free_kthread_struct) so it simply won't work.
>
> The easist way to cover both cases is to check nr_cpus_allowed, just as
> discussed in the mailing list of the v1 version of the original fix.
>
> * lmbench3.lat_proc -P 104 fork (2 NUMA, and 26 cores, 2 threads)
>

Reasoning about "proper" pcpu kthreads was simpler since they are static,
see 3a7956e25e1d ("kthread: Fix PF_KTHREAD vs to_kthread() race")

> w/out patch w/ patch
> fork+exit latency 1660 ms 1520 ms ( 8.4%)
>
> Fixes: 2f5f4cce496e ("sched/fair: Ignore percpu threads for imbalance pulls")
> Signed-off-by: Yihao Wu <wuyihao@xxxxxxxxxxxxxxxxx>
> ---
> kernel/kthread.c | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 4a4d7092a2d8..cb05d3ff2de4 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -543,11 +543,7 @@ void kthread_set_per_cpu(struct task_struct *k, int cpu)
>
> bool kthread_is_per_cpu(struct task_struct *p)
> {
> - struct kthread *kthread = __to_kthread(p);
> - if (!kthread)
> - return false;
> -
> - return test_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
> + return (p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1;
> }

As Peter said, this is going to cause issues. If you look at
kthread_set_per_cpu(), we also store a CPU value which we expect to be
valid when kthread_is_per_cpu(), which that change is breaking.

AIUI what you want to patch is the actual usage in can_migrate_task()

>
> /**
> --
> 2.32.0.604.gb1f3e1269