Re: [PATCH v3 2/2] sched/fair: Choose the CPU where short task is running during wake up

From: Chen Yu
Date: Mon Dec 05 2022 - 03:40:54 EST


On 2022-12-05 at 10:36:25 +0800, Tianchen Ding wrote:
> Hi Chen.
>
> On 2022/12/1 16:44, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continues to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 112 CPUs per LLC domain. The will-it-scale launches
> > 1, 8, 16 ... 112 instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU.
> > It is found that, once the number of instances is higher than
> > 56(112 tasks in total, every CPU has 1 task), the throughput
> > drops accordingly if the instance number continues to increase:
> >
> > ^
> > throughput|
> > | X
> > | X X X
> > | X X X
> > | X X
> > | X X
> > | X
> > | X
> > | X
> > | X
> > |
> > +-----------------.------------------->
> > 56
> > number of instances
> >
> > [Symptom analysis]
> >
> > The performance downgrading was caused by a high system idle
> > percentage(around 20% ~ 30%). The CPUs waste a lot of time in
> > idle and do nothing. As a comparison, if set CPU affinity to
> > these workloads and stops them from migrating among CPUs,
> > the idle percentage drops to nearly 0%, and the throughput
> > increases by about 300%. This indicates room for optimization.
> >
> > The reason for the high idle percentage is different before/after
> > commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU
> > on wakelist if wakee cpu is idle").
> >
> > [Before the commit]
> > The bottleneck is the runqueue spinlock.
> >
> > nr_instance rq lock percentage
> > 1 1.22%
> > 8 1.17%
> > 16 1.20%
> > 24 1.22%
> > 32 1.46%
> > 40 1.61%
> > 48 1.63%
> > 56 1.65%
> > --------------------------
> > 64 3.77% |
> > 72 5.90% | increase
> > 80 7.95% |
> > 88 9.98% v
> > 96 11.81%
> > 104 13.54%
> > 112 15.13%
> >
> > And top 2 rq lock hot paths:
> >
> > (path1):
> > raw_spin_rq_lock_nested.constprop.0;
> > try_to_wake_up;
> > default_wake_function;
> > autoremove_wake_function;
> > __wake_up_common;
> > __wake_up_common_lock;
> > __wake_up_sync_key;
> > pipe_write;
> > new_sync_write;
> > vfs_write;
> > ksys_write;
> > __x64_sys_write;
> > do_syscall_64;
> > entry_SYSCALL_64_after_hwframe;write
> >
> > (path2):
> > raw_spin_rq_lock_nested.constprop.0;
> > __sched_text_start;
> > schedule_idle;
> > do_idle;
> > cpu_startup_entry;
> > start_secondary;
> > secondary_startup_64_no_verify
> >
> > task A tries to wake up task B on CPU1, then task A grabs the
> > runqueue lock of CPU1. If CPU1 is about to quit idle, it needs
> > to grab its lock which has been taken by someone else. Then
> > CPU1 takes more time to quit which hurts the performance.
> >
> > [After the commit]
> > The cause is the race condition between select_task_rq() and
> > the task enqueue.
> >
> > Suppose there are nr_cpus pairs of ping-pong scheduling
> > tasks. For example, p0' and p0 are ping-pong scheduling,
> > so do p1' <=> p1, and p2'<=> p2. None of these tasks are
> > bound to any CPUs. The problem can be summarized as:
> > more than 1 wakers are stacked on 1 CPU, which slows down
> > waking up their wakees:
> >
> > CPU0 CPU1 CPU2
> >
> > p0' p1' => idle p2'
> >
> > try_to_wake_up(p0) try_to_wake_up(p2);
> > CPU1 = select_task_rq(p0); CPU1 = select_task_rq(p2);
> > ttwu_queue(p0, CPU1); ttwu_queue(p2, CPU1);
> > __ttwu_queue_wakelist(p0, CPU1); => ttwu_list->p0
> > quiting cpuidle_idle_call()
> >
> > ttwu_list->p2->p0 <= __ttwu_queue_wakelist(p2, CPU1);
> >
> > WRITE_ONCE(CPU1->ttwu_pending, 1); WRITE_ONCE(CPU1->ttwu_pending, 1);
> >
> > p0' => idle
> > sched_ttwu_pending()
> > enqueue_task(p2 and p0)
> >
> > idle => p2
> >
> > ...
> > p2 time slice expires
> > ...
> > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > <=== !!! p2 delays the wake up of p0' !!!
> > !!! causes long idle on CPU0 !!!
> > p2 => p0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > p0 wakes up p0'
> >
> > idle => p0'
> >
> >
> >
> > Since there are many waker/wakee pairs in the system, the chain reaction
> > causes many CPUs to be victims. These idle CPUs wait for their waker to
> > be scheduled.
> >
> > Actually Tiancheng has mentioned above issue here[2].
> >
>
> First I want to say that this issue (race condition between selecting idle
> cpu and enqueuing task) always exists before or after the commit
> f3dd3f674555. And I know this is a big issue so in [2] I only try to fix a
> small part of it. Of course I'm glad to see you trying solving this issue
> too :-)
>
> There may be a bit wrong in your comment about the order.
> "WRITE_ONCE(CPU1->ttwu_pending, 1);" on CPU0 is earlier than CPU1 getting
> "ttwu_list->p0", right?
>
Yes, you are right, I'll refine the comment.

thanks,
Chenyu