Re: [PATCH] Drivers: hv: vmbus: Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE message type

From: Andrea Parri
Date: Mon Nov 30 2020 - 18:09:15 EST


On Sun, Nov 29, 2020 at 06:29:55PM +0000, Michael Kelley wrote:
> From: Andrea Parri (Microsoft) <parri.andrea@xxxxxxxxx> Sent: Thursday, November 26, 2020 11:12 AM
> >
> > Quoting from commit 7527810573436f ("Drivers: hv: vmbus: Introduce
> > the CHANNELMSG_MODIFYCHANNEL message type"),
> >
> > "[...] Hyper-V can *not* currently ACK CHANNELMSG_MODIFYCHANNEL
> > messages with the promise that (after the ACK is sent) the
> > channel won't send any more interrupts to the "old" CPU.
> >
> > The peculiarity of the CHANNELMSG_MODIFYCHANNEL messages is
> > problematic if the user want to take a CPU offline, since we
> > don't want to take a CPU offline (and, potentially, "lose"
> > channel interrupts on such CPU) if the host is still processing
> > a CHANNELMSG_MODIFYCHANNEL message associated to that CPU."
> >
> > Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE(24) message type,
> > which embodies the type of the CHANNELMSG_MODIFYCHANNEL ACK.
>
> I feel like the commit message needs a bit more detail about the new
> functionality that is added, including introducing the new VMbus protocol
> version 5.3, and then waiting for the response message when running
> with that protocol version of later. Also, when taking a CPU offline, new
> functionality ensures that there are no pending channel interrupts for
> that CPU.

I'll add details along these lines in the next submission.


>
> Could/should this patch be broken into multiple patches?
>
> 1) Introduce and negotiate VMbus protocol version 5.3. Note that just
> negotiating version 5.3 should be safe because any ACK messages will
> be cleanly ignored by the old code.
> 2) Check for pending channel interrupts before taking a CPU offline.
> Wouldn't this check be valid even with the existing code where there is no
> ACK message? The old code is, in effect, assuming that enough time has
> passed such that the modify channel message has been processed.
> 3) Code to wait for an ACK message, and code to receive and process an
> ACK message.

I think that the patch could be broken into multiple patches. I'm still
wondering whether we could/should invoke hv_synic_event_pending() without
an ACK message (like the description above seems to suggest), say, from
the introduction of MODIFYCHANNEL messages... Thoughts?


> > +static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 target_vp)
> > +{
> > + struct vmbus_channel_modifychannel *msg;
> > + struct vmbus_channel_msginfo *info;
> > + unsigned long flags;
> > + int ret;
> > +
> > + info = kzalloc(sizeof(struct vmbus_channel_msginfo) +
> > + sizeof(struct vmbus_channel_modifychannel),
> > + GFP_KERNEL);
> > + if (!info)
> > + return -ENOMEM;
> > +
> > + init_completion(&info->waitevent);
> > + info->waiting_channel = channel;
> > +
> > + msg = (struct vmbus_channel_modifychannel *)info->msg;
> > + msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL;
> > + msg->child_relid = channel->offermsg.child_relid;
> > + msg->target_vp = target_vp;
> > +
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > + list_add_tail(&info->msglistentry, &vmbus_connection.chn_msg_list);
> > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
> > +
> > + if (channel->rescind) {
> > + ret = -ENODEV;
> > + goto free_info;
> > + }
>
> Does the above check really do anything? Such checks are sprinkled throughout
> the VMbus code, and I've always questioned their efficacy. The rescind flag can
> be set without holding the channel_mutex, so as soon as the above code tests
> the flag, it seems like it could change.

Same question (and questioning) here actually. I'm planning to remove
this check in v2. Similarly for the ->rescind check below.


>
> > +
> > + ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_modifychannel), true);
>
> Use "sizeof(*msg)" instead?

Applied.


>
> > + trace_vmbus_send_modifychannel(msg, ret);
> > + if (ret != 0)
> > + goto clean_msglist;
> > +
> > + /*
> > + * Release channel_mutex; otherwise, vmbus_onoffer_rescind() could block on
> > + * the mutex and be unable to signal the completion.
> > + */
> > + mutex_unlock(&vmbus_connection.channel_mutex);
> > + wait_for_completion(&info->waitevent);
> > + mutex_lock(&vmbus_connection.channel_mutex);
>
> The calling function, target_cpu_store(), obtains the mutex to synchronize against
> channel closure. Does releasing the mutex here still ensure the needed synchronization?
> If so, some comments explaining why could be helpful. I didn't try to reason through it
> all, so I'm not sure.

AFAICT, the synchronization against channel closure just want the mutex
critical section to include the MODIFYCHANNEL post.

AFAICT, different target_cpu_store() invocations remain serialized (by
the mutex in kernfs_fop_write()); maybe this is worth noticing?


>
> > +
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > + list_del(&info->msglistentry);
> > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
> > +
> > + if (channel->rescind) {
> > + ret = -ENODEV;
> > + goto free_info;
> > + }
>
> Again, I'm wondering if the above is actually useful.

See above.


>
> > +
> > + if (info->response.modify_response.status) {
>
> I'm thinking that current Hyper-V never sends a non-zero status. But it's good
> to check in case of future changes. The implication is that a non-zero status means
> that the request to change the target CPU was not performed, correct?

This corresponds to my understanding too. But you're right, I should
have verified with the Hyper-V team...


>
> > + kfree(info);
> > + return -EAGAIN;
> > + }
> > +
> > + kfree(info);
> > + return 0;
> > +
> > +clean_msglist:
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > + list_del(&info->msglistentry);
> > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
>
> The error handling seems more complex than needed. The "clean_msglist" label
> is only called from one place, so the above code could go inline at that place.

Mmh, it seems also wrong... IAC, I'll review it before the next
submission, taking into account these remarks.


>
> > +free_info:
> > + kfree(info);
> > + return ret;
>
> More error handling: The kfree(info) call is done in three different places. With
> consistent use of the 'ret' local variable, only one place would be needed.

See above.


>
> > +}
> > +
> > /*
> > * Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt.
> > *
> > * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. Also, Hyper-V does not
> > - * ACK such messages. IOW we can't know when the host will stop interrupting
> > - * the "old" vCPU and start interrupting the "new" vCPU for the given channel.
> > + * ACK such messages before VERSION_WIN10_V5_3. Without ACK, we can not know
> > + * when the host will stop interrupting the "old" vCPU and start interrupting
> > + * the "new" vCPU for the given channel.
>
> In the interest of clarity, make the above comment slightly more explicit: When VMbus
> version 5.3 or later is negotiated, Hyper-V always sends an ACK in response to
> CHANNELMSG_MODIFYCHANNEL. For VMbus version 5.2 and earlier, it never sends an ACK.

Applied.


> > +static void vmbus_onmodifychannel_response(struct vmbus_channel_message_header *hdr)
> > +{
> > + struct vmbus_channel_modifychannel_response *response;
> > + struct vmbus_channel_msginfo *msginfo;
> > + unsigned long flags;
> > +
> > + response = (struct vmbus_channel_modifychannel_response *)hdr;
> > +
> > + trace_vmbus_onmodifychannel_response(response);
> > +
> > + /*
> > + * Find the modify msg, copy the response and signal/unblock the wait event.
> > + */
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > +
> > + list_for_each_entry(msginfo, &vmbus_connection.chn_msg_list, msglistentry) {
> > + struct vmbus_channel_message_header *responseheader =
> > + (struct vmbus_channel_message_header *)msginfo->msg;
> > +
> > + if (responseheader->msgtype == CHANNELMSG_MODIFYCHANNEL) {
> > + struct vmbus_channel_modifychannel *modifymsg;
> > +
> > + modifymsg = (struct vmbus_channel_modifychannel *)msginfo->msg;
> > + if (modifymsg->child_relid == response->child_relid) {
> > + memcpy(&msginfo->response.modify_response, response,
> > + sizeof(struct vmbus_channel_modifychannel_response));
>
> Use "sizeof(*response)" ?

Applied.


> > @@ -237,6 +238,40 @@ void hv_synic_disable_regs(unsigned int cpu)
> > hv_set_synic_state(sctrl.as_uint64);
> > }
> >
> > +#define HV_MAX_TRIES 3
> > +/*
> > + * Scan the event flags page of 'this' CPU looking for any bit that is set. If we find one
> > + * bit set, then wait for a few milliseconds. Repeat these steps for a maximum of 3 times.
>
> A bit more comment here might be warranted. If a bit is set, that means there is a
> pending channel interrupt. The expectation is that the normal interrupt handling
> mechanism will find and process the channel interrupt "very soon", and in the process
> clear the bit. Since Hyper-V won't be setting any additional channel interrupt bits, we
> should very soon reach a state where no bits are set.

I could add something like the above. Some editing seems to be required
if we want it to be part of 2/3 (that is, before ACKs are introduced).
Similar consideration holds for my comment in hv_synic_cleanup() below.

Suggestions?


>
> > + * Return 'true', if there is still any set bit after this operation; 'false', otherwise.
> > + */
> > +static bool hv_synic_event_pending(void)
> > +{
> > + struct hv_per_cpu_context *hv_cpu = this_cpu_ptr(hv_context.cpu_context);
> > + union hv_synic_event_flags *event =
> > + (union hv_synic_event_flags *)hv_cpu->synic_event_page + VMBUS_MESSAGE_SINT;
> > + unsigned long *recv_int_page = event->flags;
>
> I think a comment is warranted here. The similar vmbus_chan_sched() code has two ways
> to get the recv_int_page, depending on the negotiated VMbus protocol version. This code
> assumes the "new" way to get the recv_int_page, which makes sense given that it is invoked
> only for newer VMbus protocol versions. But a note about that assumption would be a good
> warning for future readers/coders.

I've added a comment to this effect.


>
> > + bool pending;
> > + u32 relid;
> > + int tries = 0;
> > +
> > +retry:
> > + pending = false;
> > + for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) {
> > + /* Special case - VMBus channel protocol messages */
> > + if (relid == 0)
> > + continue;
> > + if (sync_test_bit(relid, recv_int_page)) {
> > + pending = true;
> > + break;
>
> I'm not clear on why the above test is needed. If the bit was found to be set
> by for_each_set_bit(), that seems good enough to set pending=true.

Agreed. I've removed the test.

Thanks,
Andrea