Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

From: Brendan Jackman
Date: Fri Sep 29 2017 - 13:05:49 EST



There has been a bit of discussion on this RFC, but before I do any
more work I'd really like your input on the basic idea.

Does the approach of feeding outside info like wake_q length into the
scheduler's decisions seem basically acceptable?

Cheers,
Brendan


On Fri, Aug 11 2017 at 09:45, Brendan Jackman wrote:
> This patch adds a parameter to select_task_rq, sibling_count_hint
> allowing the caller, where it has this information, to inform the
> sched_class the number of tasks that are being woken up as part of
> the same event.
>
> The wake_q mechanism is one case where this information is available.
>
> select_task_rq_fair can then use the information to detect that it
> needs to widen the search space for task placement in order to avoid
> overloading the last-level cache domain's CPUs.
>
> * * *
>
> The reason I am investigating this change is the following use case
> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
> all repeatedly do X amount of work then
> pthread_barrier_wait (i.e. sleep until the last task finishes its X
> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
> finish faster, and then those CPUs pull over the tasks that are still
> running:
>
> v CPU v ->time->
>
> -------------
> 0 (big) 11111 /333
> -------------
> 1 (big) 22222 /444|
> -------------
> 2 (LITTLE) 333333/
> -------------
> 3 (LITTLE) 444444/
> -------------
>
> Now when task 4 hits the barrier (at |) and wakes the others up,
> there are 4 tasks with prev_cpu=<big> and 0 tasks with
> prev_cpu=<little>. want_affine therefore means that we'll only look
> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
> on the bigs until the next load balance, something like this:
>
> v CPU v ->time->
>
> ------------------------
> 0 (big) 11111 /333 31313\33333
> ------------------------
> 1 (big) 22222 /444|424\4444444
> ------------------------
> 2 (LITTLE) 333333/ \222222
> ------------------------
> 3 (LITTLE) 444444/ \1111
> ------------------------
> ^^^
> underutilization
>
> So, I'm trying to get want_affine = 0 for these tasks.
>
> I don't _think_ any incarnation of the wakee_flips mechanism can help
> us here because which task is waker and which tasks are wakees
> generally changes with each iteration.
>
> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
> nice property that we know exactly how many tasks are being woken, so
> we can cheat.
>
> It might be a disadvantage that we "widen" _every_ task that's woken in
> an event, while select_idle_sibling would work fine for the first
> sd_llc_size - 1 tasks.
>
> IIUC, if wake_affine() behaves correctly this trick wouldn't be
> necessary on SMP systems, so it might be best guarded by the presence
> of SD_ASYM_CPUCAPACITY?
>
> * * *
>
> Final note..
>
> In order to observe "perfect" behaviour for this use case, I also had
> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
> above we are working through the work queue and have placed tasks 3
> and 2, and are about to place task 1:
>
> v CPU v ->time->
>
> --------------
> 0 (big) 11111 /333 3
> --------------
> 1 (big) 22222 /444|4
> --------------
> 2 (LITTLE) 333333/ 2
> --------------
> 3 (LITTLE) 444444/ <- Task 1 should go here
> --------------
>
> If TTWU_QUEUE is enabled, we will not yet have enqueued task
> 2 (having instead sent a reschedule IPI) or attached its load to CPU
> 2. So we are likely to also place task 1 on cpu 2. Disabling
> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
> solving this issue. TTWU_QUEUE is there to minimise rq lock
> contention, and I guess that this contention is less of an issue on
> big.LITTLE systems since they have relatively few CPUs, which
> suggests the trade-off makes sense here.
>
> Signed-off-by: Brendan Jackman <brendan.jackman@xxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Josef Bacik <josef@xxxxxxxxxxxxxx>
> Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>
> Cc: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
> ---
> include/linux/sched/wake_q.h | 2 ++
> kernel/sched/core.c | 34 +++++++++++++++++++++++-----------
> kernel/sched/deadline.c | 3 ++-
> kernel/sched/fair.c | 17 +++++++++++------
> kernel/sched/idle_task.c | 3 ++-
> kernel/sched/rt.c | 3 ++-
> kernel/sched/sched.h | 3 ++-
> kernel/sched/stop_task.c | 3 ++-
> 8 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index d03d8a9047dc..607a888eb35b 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -33,6 +33,7 @@
> struct wake_q_head {
> struct wake_q_node *first;
> struct wake_q_node **lastp;
> + int count;
> };
>
> #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
> {
> head->first = WAKE_Q_TAIL;
> head->lastp = &head->first;
> + head->count = 0;
> }
>
> extern void wake_q_add(struct wake_q_head *head,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0869b20fba81..ddf9257b1467 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
> return;
>
> + head->count++;
> +
> get_task_struct(task);
>
> /*
> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
> head->lastp = &node->next;
> }
>
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> + int sibling_count_hint);
> +
> void wake_up_q(struct wake_q_head *head)
> {
> struct wake_q_node *node = head->first;
> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
> task->wake_q.next = NULL;
>
> /*
> - * wake_up_process() implies a wmb() to pair with the queueing
> + * try_to_wake_up() implies a wmb() to pair with the queueing
> * in wake_q_add() so as not to miss wakeups.
> */
> - wake_up_process(task);
> + try_to_wake_up(task, TASK_NORMAL, 0, head->count);
> put_task_struct(task);
> }
> }
> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
> */
> static inline
> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
> + int sibling_count_hint)
> {
> lockdep_assert_held(&p->pi_lock);
>
> if (p->nr_cpus_allowed > 1)
> - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
> + cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
> + sibling_count_hint);
> else
> cpu = cpumask_any(&p->cpus_allowed);
>
> @@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> * @p: the thread to be awakened
> * @state: the mask of task states that can be woken
> * @wake_flags: wake modifier flags (WF_*)
> + * @sibling_count_hint: A hint at the number of threads that are being woken up
> + * in this event.
> *
> * If (@state & @p->state) @p->state = TASK_RUNNING.
> *
> @@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> * %false otherwise.
> */
> static int
> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> + int sibling_count_hint)
> {
> unsigned long flags;
> int cpu, success = 0;
> @@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> atomic_dec(&task_rq(p)->nr_iowait);
> }
>
> - cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> + cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
> + sibling_count_hint);
> if (task_cpu(p) != cpu) {
> wake_flags |= WF_MIGRATED;
> set_task_cpu(p, cpu);
> @@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
> */
> int wake_up_process(struct task_struct *p)
> {
> - return try_to_wake_up(p, TASK_NORMAL, 0);
> + return try_to_wake_up(p, TASK_NORMAL, 0, 1);
> }
> EXPORT_SYMBOL(wake_up_process);
>
> int wake_up_state(struct task_struct *p, unsigned int state)
> {
> - return try_to_wake_up(p, state, 0);
> + return try_to_wake_up(p, state, 0, 1);
> }
>
> /*
> @@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
> * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
> * as we're not fully set-up yet.
> */
> - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> + __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
> #endif
> rq = __task_rq_lock(p, &rf);
> update_rq_clock(rq);
> @@ -2893,7 +2905,7 @@ void sched_exec(void)
> int dest_cpu;
>
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> - dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
> + dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
> if (dest_cpu == smp_processor_id())
> goto unlock;
>
> @@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
> int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
> void *key)
> {
> - return try_to_wake_up(curr->private, mode, wake_flags);
> + return try_to_wake_up(curr->private, mode, wake_flags, 1);
> }
> EXPORT_SYMBOL(default_wake_function);
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 755bd3f1a1a9..69a9dd407267 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
> static int find_later_rq(struct task_struct *task);
>
> static int
> -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
> + int sibling_count_hint)
> {
> struct task_struct *curr;
> struct rq *rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..0a9d706b62bf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
> * whatever is irrelevant, spread criteria is apparent partner count exceeds
> * socket size.
> */
> -static int wake_wide(struct task_struct *p)
> +static int wake_wide(struct task_struct *p, int sibling_count_hint)
> {
> unsigned int master = current->wakee_flips;
> unsigned int slave = p->wakee_flips;
> - int factor = this_cpu_read(sd_llc_size);
> + int llc_size = this_cpu_read(sd_llc_size);
> +
> + if (sibling_count_hint >= llc_size)
> + return 1;
>
> if (master < slave)
> swap(master, slave);
> - if (slave < factor || master < slave * factor)
> + if (slave < llc_size || master < slave * llc_size)
> return 0;
> return 1;
> }
> @@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> * preempt must be disabled.
> */
> static int
> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
> + int sibling_count_hint)
> {
> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> int cpu = smp_processor_id();
> @@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
> if (sd_flag & SD_BALANCE_WAKE) {
> record_wakee(p);
> - want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> - && cpumask_test_cpu(cpu, &p->cpus_allowed);
> + want_affine = !wake_wide(p, sibling_count_hint) &&
> + !wake_cap(p, cpu, prev_cpu) &&
> + cpumask_test_cpu(cpu, &p->cpus_allowed);
> }
>
> rcu_read_lock();
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172db63e..3c343e055110 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -9,7 +9,8 @@
>
> #ifdef CONFIG_SMP
> static int
> -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
> + int sibling_count_hint)
> {
> return task_cpu(p); /* IDLE tasks as never migrated */
> }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 45caf937ef90..b9937dccb8b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
> static int find_lowest_rq(struct task_struct *task);
>
> static int
> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
> + int sibling_count_hint)
> {
> struct task_struct *curr;
> struct rq *rq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index eeef1a3086d1..56ae525618e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1419,7 +1419,8 @@ struct sched_class {
> void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>
> #ifdef CONFIG_SMP
> - int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> + int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
> + int subling_count_hint);
> void (*migrate_task_rq)(struct task_struct *p);
>
> void (*task_woken) (struct rq *this_rq, struct task_struct *task);
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 9f69fb630853..d0ce4fbb18ef 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -11,7 +11,8 @@
>
> #ifdef CONFIG_SMP
> static int
> -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
> + int sibling_count_hint)
> {
> return task_cpu(p); /* stop tasks as never migrate */
> }