Re: [RFC PATCH v1] sched/uclamp: Introduce SCHED_FLAG_RESET_UCLAMP_ON_FORK flag

From: Qais Yousef
Date: Thu Apr 20 2023 - 09:39:06 EST


On 04/19/23 15:49, Saravana Kannan wrote:
> On Wed, Apr 19, 2023 at 10:54 AM Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >
> > Hi David!
> >
> > On 04/16/23 14:34, David Dai wrote:
> > > A userspace service may manage uclamp dynamically for individual tasks and
> > > a child task will unintentionally inherit a pesudo-random uclamp setting.
> > > This could result in the child task being stuck with a static uclamp value
> > > that results in poor performance or poor power.
> > >
> > > Using SCHED_FLAG_RESET_ON_FORK is too coarse for this usecase and will
> > > reset other useful scheduler attributes. Adding a
> > > SCHED_FLAG_RESET_UCLAMP_ON_FORK will allow userspace to have finer control
> > > over scheduler attributes of child processes.
> >
> > Thanks a lot for the patch. This has a been a known limitation for a while but
> > didn't manage to find the time to push anything yet.
> >
> > ADPF (Android Dynamic Performance Framework) exposes APIs to manage performance
> > for a set of pids [1]. Only these tasks belong to the session and any forked
> > tasked is expected to have its uclamp values reset. But as you pointed out, the
> > current RESET_ON_FORK resets everything, but we don't want that as these
> > attributes don't belong to ADPF to decide whether they should be reset too or
> > not. And not resetting them means we can end up with tasks inheriting random
> > uclamp values unintentionally. We can't tell these tasks not to fork anything.
> > If the forked tasks are expected to be part of the session, then their pids
> > must be added explicitly.
> >
> > [1] https://developer.android.com/reference/android/os/PerformanceHintManager#createHintSession(int%5B%5D,%20long)
> >
> > >
> > > Cc: Qais Yousef <qyousef@xxxxxxxxxx>
> > > Cc: Quentin Perret <qperret@xxxxxxxxxx>
> > > Cc: Saravana Kannan <saravanak@xxxxxxxxxx>
> > > Signed-off-by: David Dai <davidai@xxxxxxxxxx>
> > > ---
> > > include/linux/sched.h | 3 +++
> > > include/uapi/linux/sched.h | 4 +++-
> > > kernel/sched/core.c | 6 +++++-
> > > tools/include/uapi/linux/sched.h | 4 +++-
> > > 4 files changed, 14 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 63d242164b1a..b1676b9381f9 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -885,6 +885,9 @@ struct task_struct {
> > > unsigned sched_reset_on_fork:1;
> >
> > nit: can't we convert to a flag and re-use?
> >
> > > unsigned sched_contributes_to_load:1;
> > > unsigned sched_migrated:1;
> > > +#ifdef CONFIG_UCLAMP_TASK
> > > + unsigned sched_reset_uclamp_on_fork:1;
> > > +#endif
> > >
> > > /* Force alignment to the next boundary: */
> > > unsigned :0;
> > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > > index 3bac0a8ceab2..7515106e1f1a 100644
> > > --- a/include/uapi/linux/sched.h
> > > +++ b/include/uapi/linux/sched.h
> > > @@ -132,12 +132,14 @@ struct clone_args {
> > > #define SCHED_FLAG_KEEP_PARAMS 0x10
> > > #define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
> > > #define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
> > > +#define SCHED_FLAG_RESET_UCLAMP_ON_FORK 0x80
> > >
> > > #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
> > > SCHED_FLAG_KEEP_PARAMS)
> > >
> > > #define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
> > > - SCHED_FLAG_UTIL_CLAMP_MAX)
> > > + SCHED_FLAG_UTIL_CLAMP_MAX | \
> > > + SCHED_FLAG_RESET_UCLAMP_ON_FORK)
> >
> > I was considering to have something a bit more generic that allows selecting
> > which attributes to reset.
> >
> > For example a syscall with SCHED_FLAG_RESET_ON_FORK_SEL combined with
> > SCHED_FLAG_UCLAMP_MIN/MAX will only reset those. This should make it extensible
> > if we have other similar use cases in the future. The downside it *might*
> > require to be done in a separate syscall to the one that sets these parameter.
> > But it should be done once.
>
> In addition to the downside you mentioned, I'm not a huge fan of this
> suggestion since the meaning of the SCHED_FLAG_RESET_ON_FORK_SEL flag
> changes based on what other flags or attrs are set. I'd rather we have
> explicit flags.

The concern is that these flags are limited resources. latency_nice hopefully
is coming and I don't see uclamp is an exception to warrant its own unique
reset flag. Do you think we should never ever face similar exception again?

>
> SCHED_FLAG_RESET_ON_FORK_SEL makes it harder to maintain the userspace
> code/makes it easy to accidentally introduce bugs. For example, a
> syscall could be setting UCLAMP_MIN and RESET_ON_FORK_SEL. Someone
> else might come and change the call to also set a nice value but not
> remember to split it up into two calls. Whereas with an explicit flag
> like David's proposal, we won't hit such an issue.

I think this mode of failure exists today and not new. You'll have to remember
to set the right flag to keep policy etc otherwise you can end up with
accidental effect.

That was the first suggestion comes to mind, it could be done another ways
I suppose.

> Also, we'll need to have separate flags internally to track what needs
> to be reset on fork vs not. So we really aren't saving anything by
> adding RESET_ON_FORK_SEL.

I don't get you here. Do you mean in kernel or userspace we'll have to track?
I persume the former. It's just setting a flag in reset_on_fork variable.
I don't see the problem.

If preserving the flag space is not a concern, then yeah potentially this is
okay. Though in principle I think it doesn't make sense to continue to add new
flag for every potential similar exception. History tends to repeat itself.
I'm okay with keeping it simple if the maintainers don't share the concern
about the flag space.

user_check_sched_setscheduler() prevents none privileged users from clearing
reset_on_fork. Shouldn't we do the same?

Also we should make sure to clear it after the fork. Like is done for
reset_on_fork.


Cheers

--
Qais Yousef