Re: [PATCH v11 15/29] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

From: Thomas Gleixner
Date: Mon Oct 04 2021 - 20:30:48 EST


On Fri, Oct 01 2021 at 15:37, Chang S. Bae wrote:
> arch_prctl(ARCH_SET_STATE_ENABLE, u64 bitmask)
> Some XSTATE features, such as AMX, are unavailable to applications
> until that process explicitly requests them via this call. Requests can
> be made for any number of valid user XSTATEs in a single call. This
> call is intended to be invoked very early in process initialization. A
> forked child inherits access, but permission is reset upon exec. There
> is no concept of un-requesting XSTATE access.
> Return codes:
> 0: success (including repeated calls)
> EINVAL: no hardware feature for the request
>
> arch_prctl(ARCH_GET_STATE_ENABLE, u64 *bitmask)
> Return the bitmask of permitted user XSTATE features. If XSAVE is
> disabled, the bitmask indicates only legacy states.
>
> The permission is checked at every XSTATE buffer expansion: e.g.
> XFD-induced #NM event, and ptracer's XSTATE injection. When no permission
> is found, inform userspace via SIGILL or with error code.
>
> For "dynamic" XSTATE features that have XFD hardware support, the kernel
> can enforce that users can not touch state without permission. For state
> that has no XFD support, the kernel can not prevent a user from touching
> that state.
>
> The notion of granted permission is recorded in the group leader only. A
> new task copies its permission bitmask.

while this patch does again way more than the subject suggests and
really should be split into bits and pieces, the prctl is ill defined
and the implementation is partially buggy.

> + /*
> + * @state_perm:
> + *
> + * This bitmap indicates the permission for state components.
> + *
> + * Always reference group_leader's value via
> + * get_group_state_perm() as it readily represents the process's
> + * state permission.
> + */
> + u64 state_perm;

This want's to be __state_perm to denote that this should not be
accessed directly.

Also the only reason to access this is when a task triggers a permission
check vs. it's own permissions and not unconditionally be deferenced all
over the place.

The point is that you don't want to derefence tsk->group_leader if not
absolutely required. That's why I want to have the information in
fpstate which is thread local and has to be accessed anyway.

Only if tsk->fpstate does not provide the permission then the group
leader state has to be checked, i.e. in #NM and ptrace which is a slow
path anyway. In the case that the permission check on the thread local
info fails then the task is either going to be killed or an extended
buffer has to be allocated.

See?

> +/* Require ARCH_SET_STATE_ENABLE for future features */
> +#define XFEATURE_MASK_PERMISSION_REQUIRED GENMASK_ULL(63, XFEATURE_MAX)

When you add AMX then XFEATURE_MAX is going to be past AMX. And no, even
if you fix that up later in weird ways, this does not make sense.

> +/**
> + * get_group_state_perm - Get a per-process state permission
> + * @tsk: A struct task_struct * pointer
> + * Return: A bitmap to indicate state permission.
> + */
> +static inline u64 get_group_state_perm(struct task_struct *tsk)
> +{
> + return tsk->group_leader->thread.fpu.state_perm;

This needs a READ_ONCE() because it can be concurrently modified and the
read is lockless. Which means that the write side needs a WRITE_ONCE(),
but see below.

> +}
> +
> +/**
> + * state_permitted - Check a task's permission for indicated features.

state_permitted and all the other state names are way too broad. This is
about xstate and not about random states. We have name spaces for a
reason.

> +#define ARCH_SET_STATE_ENABLE 0x1021
> +#define ARCH_GET_STATE_ENABLE 0x1022

This is about XSTATE components and should be named as such,
i.e. something which is entirely clear, e.g. _XCOMP_ because that is
what this is about. More below.

Aside of that why does this not start with the obvious and simple prctl
option to retrieve the possible supported features?

> +/**
> + * reset_state_perm - Reset a task's permission for dynamic user state
> + *
> + * It is expected to call at exec in which one task runs in a process.
> + *
> + * @task: A struct task_struct * pointer
> + */
> +void reset_state_perm(struct task_struct *tsk)
> +{
> + struct fpu *fpu = &tsk->thread.fpu;
> +
> + fpu->state_perm = xfeatures_mask_all & ~xfeatures_mask_user_perm();
> +
> + if (!xfeatures_mask_user_dynamic ||
> + !(fpu->state_mask & xfeatures_mask_user_dynamic))
> + return;
> +
> + WARN_ON(tsk->signal->nr_threads > 1);

Why? The only two callers are from set_personality*().

Aside of that why are you doing this from set_personality()?

This has absolutely nothing to do with set_personality() which is
exclusively about native and compat format of the executable.

At the point where set_personality() is invoked the task which does
exec() has already invoked flush_thread(), which invokes
fpu_flush_thread() which in turn resets the FPU state...

So what?

> +/**
> + * do_arch_prctl_state - Read or write the state permission.
> + * @fpu: A struct task_struct * pointer

fpu != tsk

Also yuck, that task argument is silly because task cannot be anything
else than current, but that's not your fault.

> + * @option: A subfunction of arch_prctl()
> + * @arg2: Either a pointer to userspace memory or state-component
> + * bitmap value.
> + * Return: 0 if successful; otherwise, return a relevant error code.
> + */
> +long do_arch_prctl_state(struct task_struct *tsk, int option, unsigned long arg2)
> +{
> + u64 features_mask;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_FPU))
> + features_mask = 0;
> + else if (use_fxsr())
> + features_mask = XFEATURE_MASK_FPSSE;
> + else
> + features_mask = XFEATURE_MASK_FP;

Why? This feature mask should be evaluated once and not over and over again.

> + switch (option) {
> + case ARCH_SET_STATE_ENABLE: {
> + u64 state_perm = arg2;

This does nowhere mention in the comments above that this limits the
available state space to 32bit for 32bit tasks running on a 64 bit
kernel. I don't think it's a problem, but it has to be documented.

> +
> + if (use_xsave())
> + features_mask = xfeatures_mask_uabi();
> +
> + if (state_perm & ~features_mask)
> + return -EINVAL;
> +
> + state_perm &= xfeatures_mask_user_perm();
> + if (!state_perm)
> + return 0;

I really do not get the semantics of this prctl at all.

GET stores _all_ permitted bits in the user space variable which makes
sense.

SET is just accepting everything except not supported bits, but as it
takes a feature bitmask it suggests that this sets the xfeature bits
which are available for the task or the process.

How does prctl(..., SET, 0) make sense?

It does not make any sense at all. There is no support for downgrading
the permitted features:

1) Default features up to AVX512 cannot be disabled

2) Once AMX (or any upcoming state) is enabled there is not way back.

So no. This really want's to be

prctl(SET, xfeature_number)

and not something which is semanticaly ill defined.

> + tsk->group_leader->thread.fpu.state_perm |= state_perm;

While this "works" with a single permission controlled state this is
completely broken once more permission controlled states come into play
when tasks of the same process invoke this concurrently and request
different features.

> + return 0;
> + }
> + case ARCH_GET_STATE_ENABLE: {
> + if (use_xsave())
> + features_mask = get_group_state_perm(tsk);
> +
> + return put_user(features_mask, (unsigned long __user *)arg2);

This is broken for 32bit kernels. The prctl is unconditional and
therefore this needs to be a *u64 cast.

Thanks,

tglx