Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

From: Andy Lutomirski
Date: Wed Oct 14 2020 - 12:10:40 EST


On Tue, Oct 13, 2020 at 11:03 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> > I have no problem with vmalloc(), but I do have a problem with
> > vfree() due to the IPIs that result. We need a cache or something.
>
> This sounds like the kind of thing we should just build into vmalloc()
> instead of having a bunch of callers implement their own caches. It
> shouldn't be too much of a challenge to have vmalloc() keep a cacheline
> or two of stats about common vmalloc() sizes and keep some pools around.
>
> It's not going to be hard to implement caches to reduce vfree()-induced
> churn, but I'm having a hard time imaging that it'll have anywhere near
> the benefits that it did for stacks. Tasks fundamentally come and go a
> *lot*, and those paths are hot.
>
> Tasks who go to the trouble to populate 8k or 64k of register state
> fundamentally *can't* come and go frequently. We also (probably) don't
> have to worry about AMX tasks doing fork()/exec() pairs and putting
> pressure on vmalloc(). Before an app can call out to library functions
> to do the fork, they've got to save the state off themselves and likely
> get it back to the init state. The fork() code can tell AMX is in the
> init state and decline to allocate actual space for the child.
>
> > I have to say: this mechanism is awful. Can we get away with skipping
> > the dynamic XSAVES mess entirely? What if we instead allocate
> > however much space we need as an array of pages and have one percpu
> > contiguous region. To save, we XSAVE(S or C) just the AMX state to
> > the percpu area and then copy it. To restore, we do the inverse. Or
> > would this kill the modified optimization and thus be horrible?
>
> Actually, I think the modified optimization would survive such a scheme:
>
> * copy page array into percpu area
> * XRSTORS from percpu area, modified optimization tuple is saved
> * run userspace
> * XSAVES back to percpu area. tuple matches, modified optimization
> is still in play
> * copy percpu area back to page array
>
> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
> XSAVE tracking hardware never knows it isn't working on the "canonical"
> buffer (the page array).

I was suggesting something a little bit different. We'd keep XMM,
YMM, ZMM, etc state stored exactly the way we do now and, for
AMX-using tasks, we would save the AMX state in an entirely separate
buffer. This way the pain of having a variable xstate layout is
confined just to AMX tasks.

I'm okay with vmalloc() too, but I do think we need to deal with the
various corner cases like allocation failing.