[PATCH v4 0/8] sched: Implement shared runqueue in fair.c

From: David Vernet
Date: Mon Dec 11 2023 - 19:31:54 EST


This is v4 of the shared runqueue patchset. This patch set is based off
of commit 418146e39891 ("freezer,sched: Clean saved_state when restoring
it during thaw") on the sched/core branch of tip.git.

In prior versions of this patch set, I was observing consistent and
statistically significant wins for several benchmarks when this feature
was enabled, such as kernel compile and hackbench. After rebasing onto
the latest sched/core on tip.git, I'm no longer observing these wins,
and in fact observe some performance loss with SHARED_RUNQ on hackbench.
I ended up bisecting this to when EEVDF was merged.

As I mentioned in [0], our plan for now is to take a step back and
re-evaluate how we want to proceed with this patch set. That said, I did
want to send this out in the interim in case it could be of interest to
anyone else who would like to continue to experiment with it.

[0]: https://lore.kernel.org/all/20231204193001.GA53255@maniforge/

v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@xxxxxxxxxxxxx/
v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@xxxxxxxxxxxxx/
v3: https://lore.kernel.org/all/20230809221218.163894-1-void@xxxxxxxxxxxxx/

v3 -> v4 changes:
- Ensure list is fully drained when toggling feature on and off (noticed
offline by Chris Mason, and independently by Aboorva Devarajan)

- Also check !is_migration_disabled() in shared_runq_pick_next_task() if
we find a task in the shard, as is_cpu_allowed() doesn't check for
migration. Also do another check for is_cpu_allowed() after
re-acquiring the task rq lock in case state has changed

- Statically initialize the shared_runq_node list node in the init task.

- Only try to pull a task from the shared_runq if the rq's root domain has
the overload bit set (K Prateek Nayak)

- Check this_rq->ttwu_pending after trying to pull a task from a shared
runqueue shard before going forward to load_balance() (K Prateek Nayak)

- Fix where we would try to skip over the lowest-level sched domain --
do the check in load_balance() instead of before, as for_each_domain()
iterates over all domains starting from the beginning (K Prateek Nayak)

- Add a selftest testcase which toggles the SHARED_RUNQ feature on and
off in a loop

Worth noting is that there have been a number of suggestions for
improving this feature that were not included in this v4, such as:

- [1], where Chen Yu <yu.c.chen@xxxxxxxxx> suggested not putting certain
tasks on a shared_runq if e.g. p->avg_runtime <= sysctl_migration_cost.
I elected to not include this, as it's a heuristic that could
incorrectly prevent work conservation, which is the primary goal of
the feature.

- [2], where K Prateek Nayak <kprateek.nayak@xxxxxxx> suggested adding a
per-shard "overload" flag that can be set to avoid contending on the
shard lock. This should be covered by checking the root domain
overload flag.

- [3], where K Prateek Nayak <kprateek.nayak@xxxxxxx> suggested also
checking rq->avg_idle < sd->max_newidle_lb_cost. This is a similar
suggestion to Chen Yu's above, and I elected to leave it out here for
the same reason: that we want to encourage work conservation.

- [4], where Gautham Shenoy <gautham.shenoy@xxxxxxx> suggests iterating
over all tasks in a shard until one is found that can be pulled,
rather than bailing out after failing to migrate the HEAD task.

None of these ideas are unreasonable, and may be worth applying if it
improves the feature for more general cases following further testing. I
left the patch set as is simply to keep the feature "consistent" in
encouraging work conservation, but that decision can be revisited.

[1]: https://lore.kernel.org/all/ZO7e5YaS71cXVxQN@chenyu5-mobl2/
[2]: https://lore.kernel.org/all/20230831104508.7619-4-kprateek.nayak@xxxxxxx/
[3]: https://lore.kernel.org/all/20230831104508.7619-3-kprateek.nayak@xxxxxxx/
[4]: https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@xxxxxxxxxxxxxxxxxxxxxx/

v2 -> v3 changes:
- Don't leave stale tasks in the lists when the SHARED_RUNQ feature is
disabled (Abel Wu)

- Use raw spin lock instead of spinlock_t (Peter)

- Fix return value from shared_runq_pick_next_task() to match the
semantics expected by newidle_balance() (Gautham, Abel)

- Fold patch __enqueue_entity() / __dequeue_entity() into previous patch
(Peter)

- Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
(Peter)

- Properly support hotplug and recreating sched domains (Peter)

- Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq ==
target_rq in shared_runq_pick_next_task() (Abel)

- Only issue list_del_init() in shared_runq_dequeue_task() if the task
is still in the list after acquiring the lock (Aaron Lu)

- Slightly change shared_runq_shard_idx() to make it more likely to keep
SMT siblings on the same bucket (Peter)

v1 -> v2 changes:
- Change name from swqueue to shared_runq (Peter)

- Shard per-LLC shared runqueues to avoid contention on scheduler-heavy
workloads (Peter)

- Pull tasks from the shared_runq in newidle_balance() rather than in
pick_next_task_fair() (Peter and Vincent)

- Rename a few functions to reflect their actual purpose. For example,
shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)

- Expose move_queued_task() from core.c rather than migrate_task_to()
(Peter)

- Properly check is_cpu_allowed() when pulling a task from a shared_runq
to ensure it can actually be migrated (Peter and Gautham)

- Dropped RFC tag


David Vernet (8):
sched: Expose move_queued_task() from core.c
sched: Move is_cpu_allowed() into sched.h
sched: Tighten unpinned rq lock window in newidle_balance()
sched: Check cpu_active() earlier in newidle_balance()
sched: Enable sched_feat callbacks on enable/disable
sched: Implement shared runqueue in CFS
sched: Shard per-LLC shared runqueues
sched: Add selftest for SHARED_RUNQ

include/linux/sched.h | 2 +
init/init_task.c | 1 +
kernel/sched/core.c | 52 +--
kernel/sched/debug.c | 18 +-
kernel/sched/fair.c | 413 +++++++++++++++++++++-
kernel/sched/features.h | 2 +
kernel/sched/sched.h | 60 +++-
kernel/sched/topology.c | 4 +-
tools/testing/selftests/sched/Makefile | 7 +-
tools/testing/selftests/sched/config | 2 +
tools/testing/selftests/sched/test-swq.sh | 23 ++
11 files changed, 521 insertions(+), 63 deletions(-)
create mode 100755 tools/testing/selftests/sched/test-swq.sh

--
2.42.1