Re: [PATCH 11/15] sched,fair: flatten hierarchical runqueues

From: Dietmar Eggemann
Date: Fri Aug 23 2019 - 14:14:46 EST


On 22/08/2019 04:17, Rik van Riel wrote:
> Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.
>
> Iteration of the sched_entity hierarchy is rate limited to once per jiffy
> per sched_entity, which is a smaller change than it seems, because load
> average adjustments were already rate limited to once per jiffy before this
> patch series.
>
> This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
> from throttled cgroups onto their cgroup runqueues, and slowly (using the
> GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
> gets unthrottled, to prevent thundering herd issues.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
>
> Header from folded patch 'fix-attach-detach_enticy_cfs_rq.patch~':
>
> Subject: sched,fair: fix attach/detach_entity_cfs_rq
>
> While attach_entity_cfs_rq and detach_entity_cfs_rq should iterate over
> the hierarchy, they do not need to so that twice.
>
> Passing flags into propagate_entity_cfs_rq allows us to reuse that same
> loop from other functions.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
>
>
> Header from folded patch 'enqueue-order.patch':
>
> Subject: sched,fair: better ordering at enqueue_task_fair time
>
> In order to get useful numbers for the task's hierarchical weight,
> task priority, etc things need to be done in a certain order at task
> enqueue time.
>
> Specifically:
> 1) static load/weight to "local" cfs_rq
> 2) propagate load/weight up the tree
> 3) add runnable load avg to root cfs_rq
>
> The reason is that each step depends on the things done by the
> step beforehand, and we can end up with nonsense numbers if we
> do not do things right.
>
> Also, make sure that we walk all the way up the hierarchy at
> enqueue_task_fair time in order to get the benefit from the ramp-up
> logic in update_cfs_group.

[...]

> /*
> @@ -6953,7 +6849,6 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> return;
>
> - find_matching_se(&se, &pse);
> update_curr(cfs_rq_of(se));
> BUG_ON(!pse);
> if (wakeup_preempt_entity(se, pse) == 1) {
> @@ -6994,100 +6889,18 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> struct task_struct *p;
> int new_tasks;
>
> + put_prev_task(rq, prev);
> again:
> if (!cfs_rq->nr_running)
> goto idle;
>
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> - if (prev->sched_class != &fair_sched_class)
> - goto simple;
> -
> - /*
> - * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> - * likely that a next task is from the same cgroup as the current.
> - *
> - * Therefore attempt to avoid putting and setting the entire cgroup
> - * hierarchy, only change the part that actually changes.
> - */
> -
> - do {
> - struct sched_entity *curr = cfs_rq->curr;
> -
> - /*
> - * Since we got here without doing put_prev_entity() we also
> - * have to consider cfs_rq->curr. If it is still a runnable
> - * entity, update_curr() will update its vruntime, otherwise
> - * forget we've ever seen it.
> - */
> - if (curr) {
> - if (curr->on_rq)
> - update_curr(cfs_rq);
> - else
> - curr = NULL;
> -
> - /*
> - * This call to check_cfs_rq_runtime() will do the
> - * throttle and dequeue its entity in the parent(s).
> - * Therefore the nr_running test will indeed
> - * be correct.
> - */
> - if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
> - cfs_rq = &rq->cfs;
> -
> - if (!cfs_rq->nr_running)
> - goto idle;
> -
> - goto simple;
> - }
> - }
> -
> - se = pick_next_entity(cfs_rq, curr);
> - cfs_rq = group_cfs_rq(se);
> - } while (cfs_rq);
> -
> - p = task_of(se);
> -
> - /*
> - * Since we haven't yet done put_prev_entity and if the selected task
> - * is a different task than we started out with, try and touch the
> - * least amount of cfs_rqs.
> - */
> - if (prev != p) {
> - struct sched_entity *pse = &prev->se;
> -
> - while (!(cfs_rq = is_same_group(se, pse))) {
> - int se_depth = se->depth;
> - int pse_depth = pse->depth;
> -
> - if (se_depth <= pse_depth) {
> - put_prev_entity(cfs_rq_of(pse), pse);
> - pse = parent_entity(pse);
> - }
> - if (se_depth >= pse_depth) {
> - set_next_entity(cfs_rq_of(se), se);
> - se = parent_entity(se);
> - }

Looks like with the se->depth related code gone here in
pick_next_task_fair() and the call to find_matching_se() in
check_preempt_wakeup() you could remove se->depth entirely.

[...]