Re: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files

From: John Baldwin
Date: Thu Mar 14 2024 - 13:37:02 EST


On 3/14/24 10:10 AM, Dave Hansen wrote:
On 3/14/24 09:45, John Baldwin wrote:
On 3/14/24 8:37 AM, Dave Hansen wrote:
On 3/14/24 04:23, Vignesh Balasubramanian wrote:
Add a new .note section containing type, size, offset and flags of
every xfeature that is present.

Mechanically, I'd much rather have all of that info in the cover letter
in the actual changelog instead.

I'd also love to see a practical example of what an actual example core
dump looks like on two conflicting systems:

    * Total XSAVE size
    * XCR0 value
    * XSTATE_BV from the core dump
    * XFEATURE offsets for each feature

I noticed this when I bought an AMD Ryzen 9 5900X based system for
my desktop running FreeBSD and found that the XSAVE core dump notes
were not recognized by GDB (FreeBSD dumps an XSAVE register set note
that matches the same layout of NT_X86_XSTATE used by Linux).

I just want to make sure that you heard what I asked. I'd like to see a
practical example of how the real-world enumeration changes between two
real world systems.

Is that possible?

Here's the raw CPUID data from the XSAVE region on my laptop:

0x0000000d 0x00: eax=0x000002e7 ebx=0x00000a88 ecx=0x00000a88 edx=0x00000000
0x0000000d 0x01: eax=0x0000000f ebx=0x00000998 ecx=0x00003900 edx=0x00000000
0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
0x0000000d 0x05: eax=0x00000040 ebx=0x00000440 ecx=0x00000000 edx=0x00000000
0x0000000d 0x06: eax=0x00000200 ebx=0x00000480 ecx=0x00000000 edx=0x00000000
0x0000000d 0x07: eax=0x00000400 ebx=0x00000680 ecx=0x00000000 edx=0x00000000
0x0000000d 0x08: eax=0x00000080 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
0x0000000d 0x09: eax=0x00000008 ebx=0x00000a80 ecx=0x00000000 edx=0x00000000
0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
0x0000000d 0x0d: eax=0x00000008 ebx=0x00000000 ecx=0x00000001 edx=0x00000000

Could we get that for an impacted AMD system, please?

cpuid -1 --raw | grep " 0x0000000d "

should do it.

0x0000000d 0x00: eax=0x00000207 ebx=0x00000988 ecx=0x00000988 edx=0x00000000
0x0000000d 0x01: eax=0x0000000f ebx=0x00000348 ecx=0x00001800 edx=0x00000000
0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
0x0000000d 0x09: eax=0x00000008 ebx=0x00000980 ecx=0x00000000 edx=0x00000000
0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000

Here, I think the ebx value for the 0x09 leaf (PKRU) is the relevant difference
here, it is 0xa80 on your laptop and 0x980 on the AMD CPU. (This is the
missing MPX gap on AMD.)

This is pretty close to just a raw dump of the XSAVE CPUID leaves.
Rather than come up with an XSAVE-specific ABI that depends on CPUID
*ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
should just bite the bullet and dump out (some of) the raw CPUID space.

So the current note I initially proposed and implemented for FreeBSD
(https://reviews.freebsd.org/D42136) and an initial patch set for GDB
(https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
do indeed dump a raw set of CPUID leaves.  The version I have for FreeBSD
only dumps the raw leaf values for leaf 0x0d though the note format is
extensible should additional leaves be needed in the future.  One of the
questions if we wanted to use a CPUID leaf note is which leaves to dump
(e.g. do you dump all of them, or do you just dump the subset that is
currently needed).

You dump what is needed and add to the dump over time.

That is what I started with, yes, but am attempting to anticipate future
problems in my list of caveats.

Another quirky question is what to do about systems with hetergeneous
cores (E vs P for example).
That's irrelevant for now. The cores may be heterogeneous but the
userspace ISA and (and thus XSAVE formats) are identical. If they're
not, then we have bigger problems on our hands.

Yes, I agree on the bigger problems and hope we don't have to solve
them.

Currently those systems use the same XSAVE layout across all cores,
but other CPUID leaves do already vary across cores on those systems.

There shouldn't be any CPUID leaves that differ _and_ matter to
userspace and thus core dumps.

Today that is true, yes. I'm fine with making that tradeoff (along
with only dumping a subset of leaves) so long as the consensus is that
is an acceptable tradeoff to make.

However, there are other wrinkles with the leaf approach.  Namely, one
of the use cases that I currently have an ugly hack for in GDB is if
you are using gdb against a remote host running gdbserver and then use
'gcore' to generate a core dump.  GDB needs to write out a NT_X86_XSTATE
note, but that note requires a layout.  What GDB does today is just pick
a known Intel layout based on the XCR0 mask.  However, GDB should ideally
start writing out whatever new note we adopt here, so if we dump raw
CPUID leaves it means extending the GDB remote protocol so we can query
the CPUID leaves from the remote host.  On the other hand, if we choose a
more abstract format as proposed in this patch, the local GDB (or LLDB
or whatever) can generate whatever synthetic layout it wants to write
the local NT_X86_XSTATE.  (NB: A relevant detail here is that the GDB
remote protocol does not pass the entire XSAVE state across as a block,
instead gdbserver parses individual register values for AVX, etc.
registers and those decoded register values are passed over the
protocol.)

So the gdb side says, "Give me PKRU" and the remote side parses the
XSAVE image, finds PKRU, and sends it over the wire?

Yes.

Another question is potentially supporting compact XSAVE format in
for NT_X86_XSTATE.  Today Linux has some complicated code to re-expand
the compat XSAVE format back out to the standard layout for ptrace() and
process core dumps.

Yeah, but supporting the compacted format in NT_X86_XSTATE doesn't help
us at all. We still intermingle user and supervisor state and that
needs to get repacked _anyway_.

Fair enough.

In other words, no matter what we do, it's going to be complicated
because the userspace buffer can't have supervisor state and the kernel
buffer does have it. The compacted format mismatch is the least of our
problems.

  (FreeBSD doesn't yet make use of XSAVEC so we
haven't yet dealt with that problem.)

... or XSAVES, which is actually the most relevant here.

Backing up... there are two approaches here:

1. Dump out raw x86-specific gunk, aka. CPUID contents itself. There
are a billion ways to do this and lots of complications, including
the remote protocol implications
or
2. Define an abstract format that works anywhere, not just on x86 and
not just for XSAVE.

There's no (sane) middle ground. The implementation here (in this
patch) is fundamentally x86-specific and pretends to be some kind of
abstracted x86-independent format.

Well, are there other register notes that could benefit from an approach
like this? Most other register notes I'm aware of on various architectures
either have a fixed layout (like the typical general purpose register notes),
or they have a fixed set of registers but the size of individual registers
can vary (thinks like SME or RISC-V's vector extension). XSAVE is the only
one I'm aware of that packs multiple register sets into a single note.

To step back a bit, another approach that could be taken (and I'm not sure
it is worth it at this point) would to stop dumping a single XSAVE note
and dump a separate register note for each feature. That is, dump a note
for AVX (the upper bits of ymmX), a note for PKRU, etc. I think if I had
to pick a strategy at the very beginning that's what I would choose now,
but this isn't the very beginning and that sort of change is likely too
disruptive. (This approach is what happens on other arches today in effect,
e.g. on AArch64.)

--
John Baldwin