Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread

From: Keno Fischer
Date: Tue Apr 14 2020 - 20:10:42 EST


> (why would you want to record with an unusual XCR0?)

I was indeed primarily thinking about the replay case when I
originally wrote this patch, but the reason for that is that the machine
that I tend to replay on is fairly new, so the traces we get tend
to be compatible. I have encountered the opposite situation also
where I wanted to send somebody a trace for them to debug, but
their machine didn't support AVX512, so I wanted to disable it.
A while ago we also encountered a similar situation where
PKRU became available on AWS so we couldn't
replay traces from there anymore.
We do of course have CPUID faulting here (which well-behaved
userspace software tends to respect), but the differing return
value from xgetbv messed up the recording nonetheless. In
order to actually make that work, masking those bits out from
XCR0 would have been required. I'm also thinking about the
future where there may be more diversity in XCR0 user state
components and what chips support what.

I assume the objection here will be that we can have our users
reboot with a different kernel command line, which is true enough
and what I will recommend to people for the time being. That said,
part of the appeal for users of rr is that it doesn't require privilege,
so it works in locked down environments like HPC clusters where
users can't easily change the kernel parameters (even KVM can
be an ask here - but is in my experience easier to negotiate with
the sysadmins).

> and replay would use KVM.

Yes, there is still a bit of state management left (e.g. we currently
rely on the kernel for mmap, fork/clone semantics and signal
delivery for deterministic signals) - though of course emulating
all that is much easier than the record side.

> I'm not sure about diversions.

Diversions (terminology note for those not familiar with record
and replay systems - a diversion is a replay that gets turned
back into a real process for the purpose of investigating the
state - think invoking a printing function from the debugger)
are similar to record in that they do handle syscalls (e.g.
for open/read/write for printing, mmap if the process allocates
during a diversion, etc.). We do have more performance
leeway for diversions than we do for records, so we might
be okay without getting too deep into KVM hacking.

I think my overall take-away is that KVM is probably feasible
for replay and diversions, but would require large patches
to KVM to be feasible for record. Unfortunately, that's
also the opposite of where getting a patch like this
into the mainline kernel would be most useful for us.
We do tend to have control over the replay machines,
so if we need to carry a patch like this, it's feasible (as
indeed we have been for the past two years), though
of course we'd like to enable this also for people who
don't want to build their own kernel (as well as for people
for whom that assumption doesn't hold - the assertion that
we control the replay machine is probably quite selfish in
how we in particular use this technology).

Hope that helps,
Keno