Re: [PATCH v3 07/21] x86/fpu/xstate: Introduce helpers to manage dynamic xstate buffers

From: Bae, Chang Seok
Date: Tue Feb 02 2021 - 23:11:34 EST


On Jan 27, 2021, at 02:41, Borislav Petkov <bp@xxxxxxx> wrote:
> On Wed, Jan 27, 2021 at 01:23:57AM +0000, Bae, Chang Seok wrote:
>> The xstate buffer may expand on the fly. The size has to be correctly
>> calculated if needed. CPUID provides essential information for the
>> calculation. Instead of reading CPUID repeatedly, store them -- the offset and
>> size are already stored here. The 64B alignment looks to be missing, so added
>> here.
>
> /me goes and digs into the SDM.
>
> Do you mean this:
>
> "Bit 01 is set if, when the compacted format of an XSAVE area is used,
> this extended state component located on the next 64-byte boundary
> following the preceding state component (otherwise, it is located
> immediately following the preceding state component)."
>
> So judging by your variable naming, you wanna record here whether the
> buffer aligns on 64 bytes.
>
> Yes, no?

Yes, you’re right.

> How about a comment over that variable so that people reading the code,
> know what it records and do not have to open the SDM each time.

Okay, how about:

This alignment bit is set if the state is saved on a 64B-aligned address in
the compacted format buffer.
"

>> Maybe:
>> "When the buffer is more than this size, the current mechanism is
>> potentially marginal to support the allocations."
>
> Where do you get those formulations?!
>
> Are you simply trying to say that for buffers larger than 64K, the
> kernel needs "a more sophisticated allocation scheme"?
>
> I'd suggest you try simple formulations first.
>
> And why does it need a more sophisticated allocation scheme? Is 64K
> magical?
>
> Also, I'm assuming here - since you're using vmalloc - that XSAVE* can
> handle virtually contiguous memory. SDM says it saves to "mem" and
> doesn't specify so it sounds like it does but let's have a confirmation
> here pls.

Yes, correct.

>>>> +#define XSTATE_BUFFER_MAX_BYTES (64 * 1024)
>>>
>>> What's that thing for when we have fpu_kernel_xstate_max_size too?
>>
>> The threshold size is what the current mechanism can comfortably allocate
>> (maybe at most). The warning is left when the buffer size goes beyond the
>> threshold. Then, we may need to consider a better allocation mechanism.
>
> As above, why?
>
>> Although a warning is given, vmalloc() may manage to allocate this size. So,
>> it was not considered a hard hit yet. vmalloc() failure will return an error
>> later.
>
> And that warning is destined for whom, exactly?
>
> When can that state become more than 64K?
>
> What is that artificial limit for?
>
> A whole lot of questions…

Okay, let me try to explain..

The threshold here could be more than that. But the intention is a heads-up to
(re-)consider (a) a new allocation mechanism and (b) to shrink the memory
allocation.

Also, the AMX state size is limited to (a bit less than) 64KB and it was
discussed that vmalloc() will be okay with AMX [2].

DaveH, correct me if I'm wrong.

>>>> + state_ptr = vmalloc(newsz);
>>>> + if (!state_ptr) {
>>>> + trace_x86_fpu_xstate_alloc_failed(fpu);
>>>
>>> WTH is that tracepoint here for?
>>
>> While it returns an error, this function can be on the path of NMI handling.
>
> How?
>
> You're allocating an xstate buffer in NMI context?!

Oh, sorry. The typo could make it confusing here -- s/NMI/#NM/.

>> Then, likely only with the “unexpected #NM exception” message. So, logging a
>> tracepoint can provide evidence of the allocation failure in that case.
>
> Who's going to see that tracepoint, people who are tracing the system
> but not normal users.

Maybe it is possible to backtrack this allocation failure out of #NM handling.
But the tracepoint can provide a clear context, although limited to those
using it.

>> PATCH9 introduces a wrapper that determines which to take. It simply returns
>> state_ptr when not a null pointer. So, the logic is to use the dynamic buffer
>> when available.
>
> Why not allocate the xstate buffer by default instead of being embedded
> in struct fpu?

Indeed, this is the most preferred way on one hand. But there was a change to
the current allocation approach by Ingo about 6 years ago [3].

So, I’m wondering his current thought on this suggestion.

> You're already determining its max_size and you can use that to do the
> allocation. Two buffers is calling for trouble.

But if so, every task will consume 8KB (or up to 64KB) with AMX. Bad is the
waste of memory for those not using the state at all.

>> [1] https://lore.kernel.org/lkml/69721125-4e1c-ca9c-ff59-8e1331933e6c@xxxxxxxxx/#t
>
> Ok, I read that subthread.
>
> The reasoning *why* we're using vmalloc() needs to be explained in a
> comment over alloc_xstate_buffer() otherwise we will forget and that is
> important.

Maybe:

If the task with vmalloc()-allocated buffer tends to terminate quickly,
vfree()-induced IPIs may be a concern. Implement cache may be helpful on this.
But the task with large state is likely to live longer. So, use vmalloc()
simply.
"
Let me know if this is not enough.

Thanks,
Chang

[2] https://lore.kernel.org/lkml/CALCETrW8u5rUsZvoo5t4YtC+4boBVcK__-srtA1+-YX06QYD1w@xxxxxxxxxxxxxx/
[3] https://lore.kernel.org/lkml/1430848300-27877-56-git-send-email-mingo@xxxxxxxxxx/