Re: [RFC] High availability in KVM

From: Takuya Yoshikawa
Date: Mon Jun 21 2010 - 21:39:15 EST

Next message: Andrew Morton: "Re: [PATCH] fs: limit maximum concurrent coredumps"
Previous message: Stewart Smith: "Re: [PATCH][RFC] Complex filesystem operations: split and join"
In reply to: Luiz Capitulino: "Re: [RFC] High availability in KVM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(2010/06/21 23:19), Luiz Capitulino wrote:

On a different note, in a HA environment the qemu policy described
above is not adequate; when a notification of a hardware error that
our policy determines to be serious arrives the first thing we want
to do is to put the virtual machine in a quiesced state to avoid
further wreckage. If we injected the error into the guest we would
risk a guest panic that might detectable only by polling or, worse,
being killed by the kernel, which means that postmortem analysis of
the guest is not possible. Once we had the guests in a quiesced
state, where all the buffers have been flushed and the hardware
sources released, we would have two modes of operation that can be
used together and complement each other.

- Proactive: A qmp event describing the error (severity, topology,
etc) is emitted. The HA software would have to register to
receive hardware error events, possibly using the libvirt
bindings. Upon receiving the event the HA software would know
that the guest is in a failover-safe quiesced state so it could
do without fencing and proceed to the failover stage directly.

This seems to match the BLOCK_IO_ERROR event we have today: when a disk error
happens, an event is emitted and the virtual machine can be automatically
stopped (there's a configuration option for this).

On the other hand, there's a number of ways to do this differently. I think
the first thing to do is to agree on what qemu's behavior is going to be, then
we decide how to expose this info to qmp clients.

I would like to support qemu/KVM bugs too in the same framework.

Even though there are some debugging ways, the easiest and most reliable one would
be using the frozen state of the guest at the moment the bug happened.

We've already experienced some qemu crashes which seemed to be caused by a KVM's
emulation failure in our test environment. Although we could guess what happened
by checking some messages like the exit reason, the guest state might have been
more help.

So what I want to get is:

- new qemu/KVM mode in which guests are automatically stopped in a failover-safe
state if qemu/KVM becomes impossible to continue,

- new interface between qemu and HA to handle the failover-safe state,

Although I personally don't mind whether the interface is event based or polling
based, one important problem from the HA's point of view would be:

* how to treat errors which can be caused in different layers uniformly.

E.g. if the problem is caused by guest side, qemu may normally exit without sending
any events to HA. So an interface for polling may be helpful even when we choose event
driven one.

Takuya

- Passive: Polling resource agents that need to check the state of
the guest generally use libvirt or a wrapper such as virsh. When
the state is SHUTOFF or CRASHED the resource agent proceeds to
the facing stage, which might be expensive and usually involves
killing the qemu process. We propose adding a new state that
indicates the failover-safe state described before. In this
state the HA software would not need to use fencing techniques
and since the qemu process is not killed postmortem analysis of
the virtual machine is still possible.

It wouldn't be polling, I guess. We already have events for most state changes.
So, when the machine stops, reboots, etc.. the client would be notified and
then it could inspect the virtual machine by using query commands.

This method would be preferable in case we also want this information available
in the user Monitor and/or if the event gets too messy because of the amount of
information we want to put in it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: [PATCH] fs: limit maximum concurrent coredumps"
Previous message: Stewart Smith: "Re: [PATCH][RFC] Complex filesystem operations: split and join"
In reply to: Luiz Capitulino: "Re: [RFC] High availability in KVM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]