Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driverobjects

From: Avi Kivity
Date: Wed Aug 19 2009 - 03:11:46 EST


On 08/19/2009 09:28 AM, Gregory Haskins wrote:
Avi Kivity wrote:
On 08/18/2009 05:46 PM, Gregory Haskins wrote:
Can you explain how vbus achieves RDMA?

I also don't see the connection to real time guests.

Both of these are still in development. Trying to stay true to the
"release early and often" mantra, the core vbus technology is being
pushed now so it can be reviewed. Stay tuned for these other
developments.

Hopefully you can outline how it works. AFAICT, RDMA and kernel bypass
will need device assignment. If you're bypassing the call into the host
kernel, it doesn't really matter how that call is made, does it?
This is for things like the setup of queue-pairs, and the transport of
door-bells, and ib-verbs. I am not on the team doing that work, so I am
not an expert in this area. What I do know is having a flexible and
low-latency signal-path was deemed a key requirement.

That's not a full bypass, then. AFAIK kernel bypass has userspace talking directly to the device.

Given that both virtio and vbus can use ioeventfds, I don't see how one can perform better than the other.

For real-time, a big part of it is relaying the guest scheduler state to
the host, but in a smart way. For instance, the cpu priority for each
vcpu is in a shared-table. When the priority is raised, we can simply
update the table without taking a VMEXIT. When it is lowered, we need
to inform the host of the change in case the underlying task needs to
reschedule.

This is best done using cr8/tpr so you don't have to exit at all. See also my vtpr support for Windows which does this in software, generally avoiding the exit even when lowering priority.

This is where the really fast call() type mechanism is important.

Its also about having the priority flow-end to end, and having the vcpu
interrupt state affect the task-priority, etc (e.g. pending interrupts
affect the vcpu task prio).

etc, etc.

I can go on and on (as you know ;), but will wait till this work is more
concrete and proven.

Generally cpu state shouldn't flow through a device but rather through MSRs, hypercalls, and cpu registers.

Basically, what it comes down to is both vbus and vhost need
configuration/management. Vbus does it with sysfs/configfs, and vhost
does it with ioctls. I ultimately decided to go with sysfs/configfs
because, at least that the time I looked, it seemed like the "blessed"
way to do user->kernel interfaces.

I really dislike that trend but that's an unrelated discussion.

They need to be connected to the real world somehow. What about
security? can any user create a container and devices and link them to
real interfaces? If not, do you need to run the VM as root?
Today it has to be root as a result of weak mode support in configfs, so
you have me there. I am looking for help patching this limitation, though.


Well, do you plan to address this before submission for inclusion?

I hope everyone agrees that it's an important issue for me and that I
have to consider non-Linux guests. I also hope that you're considering
non-Linux guests since they have considerable market share.
I didn't mean non-Linux guests are not important. I was disagreeing
with your assertion that it only works if its PCI. There are numerous
examples of IHV/ISV "bridge" implementations deployed in Windows, no?

I don't know.

If vbus is exposed as a PCI-BRIDGE, how is this different?

Technically it would work, but given you're not interested in Windows, who would write a driver?

Given I'm not the gateway to inclusion of vbus/venet, you don't need to
ask me anything. I'm still free to give my opinion.
Agreed, and I didn't mean to suggest otherwise. It not clear if you are
wearing the "kvm maintainer" hat, or the "lkml community member" hat at
times, so its important to make that distinction. Otherwise, its not
clear if this is edict as my superior, or input as my peer. ;)

When I wear a hat, it is a Red Hat. However I am bareheaded most often.

(that is, look at the contents of my message, not who wrote it or his role).

With virtio, the number is 1 (or less if you amortize). Set up the ring
entries and kick.
Again, I am just talking about basic PCI here, not the things we build
on top.

Whatever that means, it isn't interesting. Performance is measure for the whole stack.

The point is: the things we build on top have costs associated with
them, and I aim to minimize it. For instance, to do a "call()" kind of
interface, you generally need to pre-setup some per-cpu mappings so that
you can just do a single iowrite32() to kick the call off. Those
per-cpu mappings have a cost if you want them to be high-performance, so
my argument is that you ideally want to limit the number of times you
have to do this. My current design reduces this to "once".

Do you mean minimizing the setup cost? Seriously?

There's no such thing as raw PCI. Every PCI device has a protocol. The
protocol virtio chose is optimized for virtualization.
And its a question of how that protocol scales, more than how the
protocol works.

Obviously the general idea of the protocol works, as vbus itself is
implemented as a PCI-BRIDGE and is therefore limited to the underlying
characteristics that I can get out of PCI (like PIO latency).

I thought we agreed that was insignificant?

As I've mentioned before, prioritization is available on x86
But as Ive mentioned, it doesn't work very well.

I guess it isn't that important then. I note that clever prioritization in a guest is pointless if you can't do the same prioritization in the host.

, and coalescing scales badly.
Depends on what is scaling. Scaling vcpus? Yes, you are right.
Scaling the number of devices? No, this is where it improves.

If you queue pending messages instead of walking the device list, you may be right. Still, if hard interrupt processing takes 10% of your time you'll only have coalesced 10% of interrupts on average.

irq window exits ought to be pretty rare, so we're only left with
injection vmexits. At around 1us/vmexit, even 100,000 interrupts/vcpu
(which is excessive) will only cost you 10% cpu time.
1us is too much for what I am building, IMHO.

You can't use current hardware then.

You're free to demultiplex an MSI to however many consumers you want,
there's no need for a new bus for that.
Hmmm...can you elaborate?

Point all those MSIs at one vector. Its handler will have to poll all the attached devices though.

Do you use DNS. We use PCI-SIG. If Novell is a PCI-SIG member you can
get a vendor ID and control your own virtio space.
Yeah, we have our own id. I am more concerned about making this design
make sense outside of PCI oriented environments.

IIRC we reuse the PCI IDs for non-PCI.




That's a bug, not a feature. It means poor scaling as the number of
vcpus increases and as the number of devices increases.
vcpu increases, I agree (and am ok with, as I expect low vcpu count
machines to be typical).

I'm not okay with it. If you wish people to adopt vbus over virtio you'll have to address all concerns, not just yours.

nr of devices, I disagree. can you elaborate?

With message queueing, I retract my remark.

Windows,
Work in progress.

Interesting. Do you plan to open source the code? If not, will the binaries be freely available?

large guests
Can you elaborate? I am not familiar with the term.

Many vcpus.

and multiqueue out of your design.
AFAICT, multiqueue should work quite nicely with vbus. Can you
elaborate on where you see the problem?

You said you aren't interested in it previously IIRC.

x86 APIC is priority aware.

Have you ever tried to use it?

I haven't, but Windows does.
Yeah, it doesn't really work well. Its an extremely rigid model that
(IIRC) only lets you prioritize in 16 groups spaced by IDT (0-15 are one
level, 16-31 are another, etc). Most of the embedded PICs I have worked
with supported direct remapping, etc. But in any case, Linux doesn't
support it so we are hosed no matter how good it is.

I agree that it isn't very clever (not that I am a real time expert) but I disagree about dismissing Linux support so easily. If prioritization is such a win it should be a win on the host as well and we should make it work on the host as well. Further I don't see how priorities on the guest can work if they don't on the host.


They had to build connectors just like you propose to do.
More importantly, they had to build back-end busses too, no?

They had to write 414 lines in drivers/s390/kvm/kvm_virtio.c and something similar for lguest.

But you still need vbus-connector-lguest and vbus-connector-s390 because
they all talk to the host differently. So what's changed? the names?
The fact that they don't need to redo most of the in-kernel backend
stuff. Just the connector.

So they save 414 lines but have to write a connector which is... how large?

Well, venet doesn't complement virtio-net, and virtio-pci doesn't
complement vbus-connector.
Agreed, but virtio complements vbus by virtue of virtio-vbus.

I don't see what vbus adds to virtio-net.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/