Re: [PATCH] Updates to Xen hypercall preemption

From: Juergen Gross
Date: Thu Jun 22 2023 - 13:20:36 EST


On 22.06.23 18:39, Andy Lutomirski wrote:
On Thu, Jun 22, 2023, at 3:33 AM, Juergen Gross wrote:
On 22.06.23 10:26, Peter Zijlstra wrote:
On Thu, Jun 22, 2023 at 07:22:53AM +0200, Juergen Gross wrote:

The hypercalls we are talking of are synchronous ones. They are running
in the context of the vcpu doing the call (like a syscall from userland is
running in the process context).

(so time actually passes from the guest's pov?)

Correct.


The hypervisor will return to guest context from time to time by modifying
the registers such that the guest will do the hypercall again with different
input values for the hypervisor, resulting in a proper continuation of the
hypercall processing.

Eeeuw.. that's pretty terrible. And changing this isn't in the cards,
like at all?

In the long run this should be possible, but not for already existing Xen
versions.


That is, why isn't this whole thing written like:

for (;;) {
ret = hypercall(foo);
if (ret == -EAGAIN) {
cond_resched();
continue;
}
break;
}

The hypervisor doesn't return -EAGAIN for hysterical reasons.

This would be one of the options to change the interface. OTOH there are cases
where already existing hypercalls need to be modified in the hypervisor to do
preemption in the middle due to e.g. security reasons (avoiding cpu hogging in
special cases).

Additionally some of the hypercalls being subject to preemption are allowed in
unprivileged guests, too. Those are mostly hypercalls allowed for PV guests
only, but some are usable by all guests.


It is an awful interface and I agree that switching to full preemption in
dom0 seems to be the route which we should try to take.

Well, I would very strongly suggest the route to take is to scrap the
whole thing and invest in doing something saner so we don't have to jump
through hoops like this.

This is quite possibly the worst possible interface for this Xen could
have come up with -- awards material for sure.

Yes.


The downside would be that some workloads might see worse performance
due to backend I/O handling might get preempted.

Is that an actual concern? Mark this a legaxy inteface and anybody who
wants to get away from it updates.

It isn't that easy. See above.


Just thinking - can full preemption be enabled per process?

Nope, that's a system wide thing. Preemption is something that's driven
by the requirements of the tasks that preempt, not something by the
tasks that get preempted.

Depends. If a task in a non-preempt system could switch itself to be
preemptable, we could do so around hypercalls without compromising the
general preemption setting. Disabling preemption in a preemptable system
should continue to be possible for short code paths only, of course.

Andy's idea of having that thing intercepted as an exception (EXTABLE
like) and relocating the IP to a place that does cond_resched() before
going back is an option.. gross, but possibly better, dunno.

Quite the mess indeed :/

Yeah.

Having one implementation of interrupt handlers that schedule when they interrupt kernel code (the normal full preempt path) is one thing. Having two of them (full preempt and super-special-Xen) is IMO quite a bit worse. Especially since no one tests the latter very well.

Having a horrible Xen-specific extable-like thingy seems honestly rather less bad. It could even have a little self-contained test that runs at boot, I bet.

But I'll bite on the performance impact issue. What, exactly, is wrong with full preemption? Full preemption has two sources of overhead, I think. One is a bit of bookkeeping. The other is the overhead inherent in actually rescheduling -- context switch cost, losing things from cache, etc.

The bookkeeping part should have quite low overhead. The scheduling part sounds like it might just need some scheduler tuning if it's really a problem.

In any case, for backend IO, full preemption sounds like it should be a win, not a loss. If I'm asking dom0 to do backend IO for me, I don't want it delayed because dom0 was busy doing something else boring. IO is faster when the latency between requesting it and actually submitting it to hardware is lower.

Maybe. I was assuming that full preemption would result in more context
switches, especially in case many guests are hammering dom0 with I/Os.
This means that more time is spent with switching instead of doing real
work, resulting in dom0 being at 100% cpu faster with doing less work.

IMHO the reason is similar to the reason why servers tend to be run
without preemption (higher throughput at the expense of higher latency).
Full preemption is preferred for systems being used interactively, like
workstations and laptops, as here latency does matter, as long as the
system isn't limited by cpu power most of the time.

I'm pretty sure Xen installations like in QubesOS will prefer to run the
guests fully preemptive for that very reason.

Can anyone actually demonstrate full preemption being a loss on a real Xen PV workload?

Should be doable, but I think above reasoning is pointing into the right
direction already.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature