Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

From: Thiago Macieira
Date: Mon Aug 09 2021 - 19:42:41 EST


On Monday, 9 August 2021 15:08:19 PDT Bae, Chang Seok wrote:
> I suspect the EBUSY situation is somewhat imaginative. In theory, the
> situation might be possible one thread calls this syscall at some point when
> a new task is being created -- after task data duplication [1] and before
> enlisted [2].
>
> As stated in the changelog, the alternative is possible:
> > An alternative implementation would not save the permission bitmap in
> > every task. But instead would extend the per-process signal data, and
> > that would not be subject to this race.
>
> But it involves quite a bit of code complexity and this is pretty much
> backend. I think it is possible to follow up and update when the case ever
> turns out to be real. At least, I'm not aware of any report against the
> PR_SET_FP_MODE prctl(2) [3] which took the same way -- walk and update the
> task list.
>
> Perhaps, the hunk above can be improved to be atomic.
>
> <snip>

Hello Chang

Thanks for taking a look at this. I agree that this is a very, very tiny
corner case and the issue can be treated as a bugfix later. The API between
userspace and kernel is fine, which is the big issue right now.

But explaining what the issue I see is: a userspace library cannot enforce
that other threads in the same process aren't either making this same system
call or starting/exiting threads. So I see two scenarios where the corner case
can be reached:

1) two simultaneous ARCH_SET_STATE_ENABLE calls
Imagine two threads, each setting a different bit (say bits 18 and 19). Since
they race with each other and this line:
t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
is not using an atomic, the compiler won't emit LOCK OR, so it's possible the
two calls will step over each other and partially undo the other's work. The
result after the two calls is completely indeterminate, yet both functions
returned success.

Since right now there's only one bit that can be set, we know that the two
calls are doing the same thing, so they're not effectively racing each other.
So this case is not an issue *right* *now*. There's only duplicate work.

2) one thread calls ARCH_SET_STATE_ENABLE while another thread exits/starts
In this case, the nr_threads != tsk->signal->nr_threads test will fail
resulting in the dynamic state to be rolled back and the EBUSY condition. I'd
like a recommendation from the kernel on how to deal with that signal: should
I retry? For now, I'm going to treat EBUSY like EINVAL and assume I cannot use
the feature.

1+2) both situations at the same time
This means the corruption can get worse since the rollback code can undo or
partially undo the progression of the other ARCH_SET_STATE_ENABLE.

> > So I have to insist that the XGETBV instruction's result match exactly
> > what is permitted to run. That means we either enable AMX unconditionally
> > with no need for system calls (with or without XFD trapping to
> > dynamically allocate more state), or that the XCR0 register be set
> > without the AMX bits by default, until the system call is issued.
>
> XCR0 provokes VMEXIT which will impact the performance hardly. At least the
> opt-in model is a consensus out of the long debate [4]. Let alone the
> question on how well advertise this new syscall though.

I understand.

I am pointing out that this will cause crashes because of improperly /
insufficiently-tested software. That is, software that violates the contract
of the new API because we inserted a new requirement that didn't exist for old
features. Yes, said software is buggy.

The problem is that the crashes can be surprising and will only show up after
the software gets run on an AMX-capable machine. That may happen, for example,
if a cloud provider "upgrades" the instance of a VM from a previous generation
of processor to a new one, or if a batch job does include the new instance
type. That is, the crashes will not happen for the developer of the software
in question, but instead for the user.

However, given the requirements that:
a) XCR0 not be context-switched
b) a new API call be required to allow the new instructions

Then there's no alternative.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering