Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

From: Chen Yu
Date: Tue Aug 22 2023 - 02:49:19 EST


Hi Peter, Mike,

On 2023-08-18 at 09:09:29 +0800, Chen Yu wrote:
> On 2023-08-16 at 15:40:59 +0200, Peter Zijlstra wrote:
> > On Wed, Aug 16, 2023 at 02:37:16PM +0200, Peter Zijlstra wrote:
> > > On Mon, Aug 14, 2023 at 08:32:55PM +0200, Mike Galbraith wrote:
> > >
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -875,6 +875,12 @@ static struct sched_entity *pick_eevdf(s
> > > > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > > > curr = NULL;
> > > >
> > > > + /*
> > > > + * Once selected, run the task to parity to avoid overscheduling.
> > > > + */
> > > > + if (sched_feat(RUN_TO_PARITY) && curr)
> > > > + return curr;
> > > > +
> > > > while (node) {
> > > > struct sched_entity *se = __node_2_se(node);
> > > >
> > >
> > > So I read it wrong last night... but I rather like this idea. But
> > > there's something missing. When curr starts a new slice it should
> > > probably do a full repick and not stick with it.
> > >
> > > Let me poke at this a bit.. nice
> >
> > Something like so.. it shouldn't matter much now, but might make a
> > difference once we start mixing different slice lengths.
> >
> > ---
> > kernel/sched/fair.c | 12 ++++++++++++
> > kernel/sched/features.h | 1 +
> > 2 files changed, 13 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fe5be91c71c7..128a78f3f264 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -873,6 +873,13 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > curr = NULL;
> >
> > + /*
> > + * Once selected, run a task until it either becomes non-eligible or
> > + * until it gets a new slice. See the HACK in set_next_entity().
> > + */
> > + if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
> > + return curr;
> > +
> > while (node) {
> > struct sched_entity *se = __node_2_se(node);
> >
> > @@ -5168,6 +5175,11 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > update_stats_wait_end_fair(cfs_rq, se);
> > __dequeue_entity(cfs_rq, se);
> > update_load_avg(cfs_rq, se, UPDATE_TG);
> > + /*
> > + * HACK, stash a copy of deadline at the point of pick in vlag,
> > + * which isn't used until dequeue.
> > + */
> > + se->vlag = se->deadline;
> > }
> >
> > update_stats_curr_start(cfs_rq, se);
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 61bcbf5e46a4..f770168230ae 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -6,6 +6,7 @@
> > */
> > SCHED_FEAT(PLACE_LAG, true)
> > SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> > +SCHED_FEAT(RUN_TO_PARITY, true)
> >
> > /*
> > * Prefer to schedule the task we woke last (assuming it failed
>
> I had a test on top of this, and it restored some performance:
>
> uname -r
> 6.5.0-rc2-mixed-slice-run-parity
>
> echo NO_RUN_TO_PARITY > /sys/kernel/debug/sched/features
> hackbench -g 112 -f 20 --threads -l 30000 -s 100
> Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 118.220 <---
>
>
> echo RUN_TO_PARITY > /sys/kernel/debug/sched/features
> hackbench -g 112 -f 20 --threads -l 30000 -s 100
> Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 100.873 <---
>
> Then I switched to the first commit of EEVDF, that is to
> introduce the avg_runtime for cfs_rq(but not use it yet)
> It seems that there is still 50% performance to restored:
>
> uname -r
> 6.5.0-rc2-add-cfs-avruntime
>
> hackbench -g 112 -f 20 --threads -l 30000 -s 100
> Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 52.569 <---
>
> I'll collect the statistic to check what's remainning there.
>
> [Besides, with the $subject benchmark, I tested it with PLACE_DEADLINE_INITIAL
> diabled, and it did not show much difference. I'll work with lkp
> to test it with RUN_TO_PARITY enabled.]
>
>

To figure out the remainning hackbench performance gap, I pulled the
tip/sched/core and launched Flamegraph to track the perf profile.
The bottleneck according to flamegraph profile is as followed:

Baseline:
"sched/fair: Add cfs_rq::avg_vruntime"
unix_stream_sendmsg() -> slab_alloc_node() -> memcg_slab_pre_alloc_hook()
-> obj_cgroup_charge_pages() -> atomic_add()
The per memory cgroup counter atomic write is the bottleneck[1]

Compare:
"sched/eevdf: Curb wakeup-preemption"
unix_stream_sendmsg() -> slab_alloc_node() -> ___slab_alloc()
-> get_partial() -> get_partial_node()
-> spinlock_irq()
The per kmem_cache_node's list_lock spinlock is the bottleneck[2]

In theory the negative impact of atomic write should be less than the
spinlock, because the latter would increase the busy wait time a lot.
And this might be the reason why we saw lower throughput with EEVDF.

Since EEVDF does not change any code in mm, I've no idea why the bottleneck
switches from atomic write to spinlock. One reason I can guess is the timing.
Before EEVDF, tasks enter spinlock critical section in difference time
slots, while after EEVDF, tasks enter the critical section at the same
time.

Reducing the spinlock contention could help improve the situation.
Feng Tang told me that there is a discussion on the LKML related to high
slub spinlock contention[2]. And Feng provides a hack patch which
enlarges the slab percpu partial list order so as to reduce the lock
contention on the node(if I understand the patch correctly)

With this patch applied, the spinlock contention has been reduced a lot[3],
and the performance has been improved to:

hackbench -g 112 -f 20 --threads -l 30000 -s 100
Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 71.838

I guess if we enlarge the percpu slab partial list further, it can reduce the
spinlock contention further, and get more improvement. Meanwhile I'm still
wondering what the reason to cause multiple tasks enter the spinlock section
at the same time.

BTW, with RUN_TO_PARITY enabled, the blogbench has also restored some
of its performance.


[1] https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/eevdf/flamegraph_add_avg_runtime.svg
[2] https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/eevdf/flamegraph_eevdf_parity.svg
[3] https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/eevdf/flamegraph_slab_enlarge_order_eevdf_parity.svg

thanks,
Chenyu