Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall

From: Thomas Gleixner
Date: Fri Oct 01 2021 - 05:56:43 EST


On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
>> All of this is solved already otherwise CPU hot unplug would explode in
>> your face every time. The software IPI send side is carefully
>> synchronized vs. hotplug (at least in theory). May I ask you politely to
>> make yourself familiar with all that before touting "We do need..." based
>> on random assumptions?
>
> I'm aware that the software send IPI side is synchronized against
> hotplug. But SENDUIPI is not unless we're going to have the CPU
> offline code IPI every other CPU to make sure that their SENDUIPIs
> have completed -- we don't control the SENDUIPI code.

That's correct, but on CPU hot unplug _all_ running tasks have been
migrated to an online CPU _before_ the APIC is turned off. So they all
went through schedule() which set the UPID->SN bit. That's obviously
racy, but that has to be handled in exit to user mode anyway because
that's not different from any other migration or preemption. So that's
_not_ a problem at all.

The problem only exists if we can't do the #GP trick for tasks which are
sitting in uintr_wait(). Because then we _have_ to be careful vs. a
concurrent SENDUPI. But that'd be not any different from the problem
vs. device interrupts which we have today.

If we can use #GP then there is no problem at all and we avoid all the
nasty stuff vs. hotplug and avoid the list walk etc.

> After reading the ISE docs again, I think it might be possible to use
> the ON bit to synchronize. In the schedule-out path, if we discover
> that ON = 1, then there is an IPI in flight to us. In theory, we
> could wait for it, although actually doing so could be a mess. That's
> why I'm asking whether there's a way to tell the APIC to literally
> wait for all IPIs that are *already sent* to be delivered.

You could busy poll with interrupts enabled, but that does not solve
anything. What guarantees that after APIC.IRR is clear no further IPI is
sent? Nothing at all. But again, that's not any different from device
interrupts and we already handle that today:

cpu down()
...
disable interrupts();
for_each_interrupt_affine_to_cpu(irq) {
change_affinity_to_online_cpu(irq, new_target_cpu);
// Did device send to the old vector?
if (APIC.IRR & vector_bit(old_vector))
send_IPI(new_target_cpu, new_vector);
}

So for uintr_wait() w/o #GP we'd need to do:

for_each_waiter_on_cpu(w) {
move_waiter_to_new_target_cpu_wait_list(w);
w->ndest = new_target_cpu;
if (w->ON)
send_IPI(new_target_cpu, UIWAIT_VECTOR);
}

>> The uintr_wait() syscall creates the very same problem as we have with
>> device interrupts. Which means we need to make that wait thing:
>>
>> upid = T1->upid;
>> upid->vector = UINTR_WAIT_VECTOR;
>
> This is exactly what I'm suggesting we *don't* do. Instead we set a
> reserved bit, we decode SENDUIPI in the #GP handler, and we emulate,
> in-kernel, the notification process for non-running tasks.

Yes, under the assumption that we can use #GP without breaking device
delivery.

> Now that I read the docs some more, I'm seriously concerned about this
> XSAVE design. XSAVES with UINTR is destructive -- it clears UINV. If
> we actually use this, then the whole last_cpu "preserve the state in
> registers" optimization goes out the window. So does anything that
> happens to assume that merely saving the state doesn't destroy it on
> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which
> makes me nervous and would need a serious audit of our XRSTORS paths.

I have no idea what you are fantasizing about. You can XRSTORS five
times in a row as long as your XSTATE memory image is correct.

If you don't want to use XSAVES to manage UINTR then you have to manualy
fiddle with the MSRs and UIF in schedule() and return to user space.

Also keeping UINV alive when scheduling out creates a life time issue
vs. UPID:

CPU 0 CPU 1 CPU2
T1 -> schedule // UPID is live in UINTR MSRs
do_stuff_in_kernel()
local_irq_disable();
SENDUIPI(T1 -> CPU1)
pull T1
T1 exits
free UPID

local_irq_enable();
ucode handles UINV -> UAF

Clearing UINV prevents the ucode from handling the IPI and fiddling with
UPID. The CPU will forward the IPI vector to the kernel which acks it
and does nothing else, i.e. it's a spurious interrupt.

Coming back to state preserving. All what needs to be done for a
situation where the rest of the XSTATE is live in the registers, i.e.
the T -> kthread -> T scheduling scenario, is to restore UINV on exit to
user mode and handle UPID.PIR which might contain newly set bits which
are obviously not yet in UPID.IRR. That can be done by MSR fiddling or
by issuing an self IPI on the UINV vector which will be handled in ucode
on the first user space instruction after return.

When the FPU has to be restored then the state has to be updated in the
XSAVE memory image before doing XRSTORS.

Thanks,

tglx