Re: [PATCH v4] sched/numa: add per-process numa_balancing

From: Peter Zijlstra
Date: Thu Sep 29 2022 - 04:45:36 EST



The alternative to this is ofcourse to have your latency critical
applications use mbind()/set_mempolicy() etc.., because surely, them
being timing critical, they have the infrastructure to do this right?

Because timing critical software doesn't want it's memory spread
randomly, because well random is bad for performance, hmm?

And once numa balancing sees all the memory has an expliciy policy, it
won't touch it.

On Thu, Sep 29, 2022 at 02:43:58PM +0800, Gang Li wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ef0e6b3e08ff..87215b3776c9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2818,6 +2818,24 @@ void task_numa_free(struct task_struct *p, bool final)
> }
> }
>
> +inline bool numa_balancing_enabled(struct task_struct *p)

Does this want to be static?

> +{
> + if (p->mm) {
> + int numab = p->mm->numab_enabled;
> +
> + switch (numab) {
> + case NUMAB_ENABLED:
> + return true;
> + case NUMAB_DISABLED:
> + return false;
> + case NUMAB_DEFAULT:
> + break;
> + }
> + }
> +
> + return static_branch_unlikely(&sched_numa_balancing);
> +}

Blergh, this sucks. Now you have the unconditional pointer chasing and
cache-misses. The advantage of sched_numa_balancing was that there is no
overhead when disabled.

Also, "numab" is a weird word.

What about something like:

static inline bool numa_balancing_enabled(struct task_struct *p)
{
if (!static_branch_unlikely(&sched_numa_balancing))
return false;

if (p->mm) switch (p->mm->numa_balancing_mode) {
case NUMA_BALANCING_ENABLED:
return true;
case NUMA_BALANCING_DISABLED:
return false
default:
break;
}

return sysctl_numa_balancing_mode;
}

( Note how that all following the existing 'numa_balancing' wording
without inventing weird new words. )

And then you frob the sysctl and prctl such that sched_numa_balancing
and sysctl_numa_balancing_mode are not tied together just so.
Specifically, I'm thinking you should use static_branch_inc() to count
how many enables you have, one for the default and one for each prctl().
Then it all just works.

> @@ -11581,8 +11599,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> entity_tick(cfs_rq, se, queued);
> }
>
> - if (static_branch_unlikely(&sched_numa_balancing))
> +#ifdef CONFIG_NUMA_BALANCING
> + if (numa_balancing_enabled(curr))
> task_tick_numa(rq, curr);
> +#endif
>
> update_misfit_status(curr, rq);
> update_overutilized_status(task_rq(curr));

Surely you can make that #ifdef go away without much effort.

> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8a6432465dc5..11720a35455a 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -59,6 +59,7 @@
> #include <linux/sched/coredump.h>
> #include <linux/sched/task.h>
> #include <linux/sched/cputime.h>
> +#include <linux/sched/numa_balancing.h>
> #include <linux/rcupdate.h>
> #include <linux/uidgid.h>
> #include <linux/cred.h>
> @@ -2101,6 +2102,23 @@ static int prctl_set_auxv(struct mm_struct *mm, unsigned long addr,
> return 0;
> }
>
> +#ifdef CONFIG_NUMA_BALANCING
> +static int prctl_pid_numa_balancing_write(int numa_balancing)
> +{
> + if (numa_balancing != PR_SET_NUMAB_DEFAULT
> + && numa_balancing != PR_SET_NUMAB_DISABLED
> + && numa_balancing != PR_SET_NUMAB_ENABLED)
> + return -EINVAL;

Operators go at the end of the line.

> + current->mm->numab_enabled = numa_balancing;
> + return 0;
> +}