RE: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to VMMs running on Hyper-V

From: Saurabh Singh Sengar
Date: Fri Aug 18 2023 - 09:10:01 EST




> -----Original Message-----
> From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx>
> Sent: Friday, August 18, 2023 3:32 AM
> To: linux-hyperv@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> x86@xxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-
> arch@xxxxxxxxxxxxxxx
> Cc: patches@xxxxxxxxxxxxxxx; Michael Kelley (LINUX)
> <mikelley@xxxxxxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>;
> wei.liu@xxxxxxxxxx; Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; Dexuan Cui
> <decui@xxxxxxxxxxxxx>; apais@xxxxxxxxxxxxxxxxxxx; Tianyu Lan
> <Tianyu.Lan@xxxxxxxxxxxxx>; ssengar@xxxxxxxxxxxxxxxxxxx; MUKESH
> RATHOR <mukeshrathor@xxxxxxxxxxxxx>; stanislav.kinsburskiy@xxxxxxxxx;
> jinankjain@xxxxxxxxxxxxxxxxxxx; vkuznets <vkuznets@xxxxxxxxxx>;
> tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; bp@xxxxxxxxx;
> dave.hansen@xxxxxxxxxxxxxxx; hpa@xxxxxxxxx; will@xxxxxxxxxx;
> catalin.marinas@xxxxxxx
> Subject: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to
> VMMs running on Hyper-V
>
> Add mshv, mshv_root, and mshv_vtl modules:
>
> Module mshv is the parent module to the other two. It provides /dev/mshv,
> plus
> some common hypercall helper code. When one of the child modules is
> loaded, it
> is registered with the mshv module, which then provides entry point(s) to the
> child module via the IOCTLs defined in uapi/linux/mshv.h.
>
> E.g. When the mshv_root module is loaded, it registers itself, and the
> MSHV_CREATE_PARTITION IOCTL becomes available in /dev/mshv. That is
> used to
> get a partition fd managed by mshv_root.
>
> Similarly for mshv_vtl module, there is MSHV_CREATE_VTL, which creates
> an fd representing the lower vtl, managed by mshv_vtl.
>
> Module mshv_root provides APIs for creating and managing child partitions.
> It
> defines abstractions for partitions (vms), vps (vcpus), and other things
> related to running a guest. It exposes the userspace interfaces for a VMM to
> manage the guest.
>
> Module mshv_vtl provides VTL (Virtual Trust Level) support for VMMs. In
> this scenario, the host kernel and VMM run in a higher trust level than the
> guest, but within the same partition. This provides better isolation and
> performance.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx>
> ---
> drivers/hv/Kconfig | 50 +
> drivers/hv/Makefile | 20 +
> drivers/hv/hv_call.c | 119 ++
> drivers/hv/hv_common.c | 4 +
> drivers/hv/mshv.h | 156 +++
> drivers/hv/mshv_eventfd.c | 758 ++++++++++++
> drivers/hv/mshv_eventfd.h | 80 ++
> drivers/hv/mshv_main.c | 208 ++++
> drivers/hv/mshv_msi.c | 129 +++
> drivers/hv/mshv_portid_table.c | 84 ++
> drivers/hv/mshv_root.h | 194 ++++
> drivers/hv/mshv_root_hv_call.c | 1064 +++++++++++++++++
> drivers/hv/mshv_root_main.c | 1964
> ++++++++++++++++++++++++++++++++
> drivers/hv/mshv_synic.c | 689 +++++++++++
> drivers/hv/mshv_vtl.h | 52 +
> drivers/hv/mshv_vtl_main.c | 1542 +++++++++++++++++++++++++
> drivers/hv/xfer_to_guest.c | 28 +
> include/uapi/linux/mshv.h | 298 +++++
> 18 files changed, 7439 insertions(+)
> create mode 100644 drivers/hv/hv_call.c
> create mode 100644 drivers/hv/mshv.h
> create mode 100644 drivers/hv/mshv_eventfd.c
> create mode 100644 drivers/hv/mshv_eventfd.h
> create mode 100644 drivers/hv/mshv_main.c
> create mode 100644 drivers/hv/mshv_msi.c
> create mode 100644 drivers/hv/mshv_portid_table.c
> create mode 100644 drivers/hv/mshv_root.h
> create mode 100644 drivers/hv/mshv_root_hv_call.c
> create mode 100644 drivers/hv/mshv_root_main.c
> create mode 100644 drivers/hv/mshv_synic.c
> create mode 100644 drivers/hv/mshv_vtl.h
> create mode 100644 drivers/hv/mshv_vtl_main.c
> create mode 100644 drivers/hv/xfer_to_guest.c
> create mode 100644 include/uapi/linux/mshv.h
>
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 00242107d62e..0d9aefc07b15 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -54,4 +54,54 @@ config HYPERV_BALLOON
> help
> Select this option to enable Hyper-V Balloon driver.
>
> +config MSHV
> + tristate "Microsoft Hypervisor root partition interfaces: /dev/mshv"
> + depends on X86_64 && HYPERV
> + select EVENTFD
> + select MSHV_XFER_TO_GUEST_WORK
> + help
> + Select this option to enable core functionality for managing guest
> + virtual machines running under the Microsoft Hypervisor.
> +
> + The interfaces are provided via a device named /dev/mshv.
> +
> + To compile this as a module, choose M here.
> +
> + If unsure, say N.
> +
> +config MSHV_ROOT
> + tristate "Microsoft Hyper-V root partition APIs driver"
> + depends on MSHV
> + help
> + Select this option to provide /dev/mshv interfaces specific to
> + running as the root partition on Microsoft Hypervisor.
> +
> + To compile this as a module, choose M here.
> +
> + If unsure, say N.
> +
> +config MSHV_VTL
> + tristate "Microsoft Hyper-V VTL driver"
> + depends on MSHV
> + select HYPERV_VTL_MODE
> + select TRANSPARENT_HUGEPAGE

TRANSPARENT_HUGEPAGE can be avoided for now.

> + help
> + Select this option to enable Hyper-V VTL driver.
> + Virtual Secure Mode (VSM) is a set of hypervisor capabilities and
> + enlightenments offered to host and guest partitions which enables
> + the creation and management of new security boundaries within
> + operating system software.
> +
> + VSM achieves and maintains isolation through Virtual Trust Levels
> + (VTLs). Virtual Trust Levels are hierarchical, with higher levels
> + being more privileged than lower levels. VTL0 is the least privileged
> + level, and currently only other level supported is VTL2.
> +
> + To compile this as a module, choose M here.
> +
> + If unsure, say N.
> +
> +config MSHV_XFER_TO_GUEST_WORK
> + bool
> +
> endmenu
> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> index d76df5c8c2a9..da7aa7542b05 100644
> --- a/drivers/hv/Makefile
> +++ b/drivers/hv/Makefile
> @@ -2,10 +2,30 @@
> obj-$(CONFIG_HYPERV) += hv_vmbus.o
> obj-$(CONFIG_HYPERV_UTILS) += hv_utils.o
> obj-$(CONFIG_HYPERV_BALLOON) += hv_balloon.o
> +obj-$(CONFIG_MSHV) += mshv.o
> +obj-$(CONFIG_MSHV_VTL) += mshv_vtl.o
> +obj-$(CONFIG_MSHV_ROOT) += mshv_root.o
>
> CFLAGS_hv_trace.o = -I$(src)
> CFLAGS_hv_balloon.o = -I$(src)
>
> +CFLAGS_mshv_main.o = -DHV_HYPERV_DEFS
> +CFLAGS_hv_call.o = -DHV_HYPERV_DEFS
> +CFLAGS_mshv_root_main.o = -DHV_HYPERV_DEFS
> +CFLAGS_mshv_root_hv_call.o = -DHV_HYPERV_DEFS
> +CFLAGS_mshv_synic.o = -DHV_HYPERV_DEFS
> +CFLAGS_mshv_portid_table.o = -DHV_HYPERV_DEFS
> +CFLAGS_mshv_eventfd.o = -DHV_HYPERV_DEFS
> +CFLAGS_mshv_msi.o = -DHV_HYPERV_DEFS
> +CFLAGS_mshv_vtl_main.o = -DHV_HYPERV_DEFS
> +
> +mshv-y += mshv_main.o
> +mshv_root-y := mshv_root_main.o mshv_synic.o
> mshv_portid_table.o \
> + mshv_eventfd.o mshv_msi.o
> mshv_root_hv_call.o hv_call.o
> +mshv_vtl-y := mshv_vtl_main.o hv_call.o
> +
> +obj-$(CONFIG_MSHV_XFER_TO_GUEST_WORK) += xfer_to_guest.o
> +
> hv_vmbus-y := vmbus_drv.o \
> hv.o connection.o channel.o \
> channel_mgmt.o ring_buffer.o hv_trace.o
> diff --git a/drivers/hv/hv_call.c b/drivers/hv/hv_call.c
> new file mode 100644
> index 000000000000..4455001d8545
> --- /dev/null
> +++ b/drivers/hv/hv_call.c
> @@ -0,0 +1,119 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + *
> + * Hypercall helper functions shared between mshv modules.
> + *
> + * Authors:
> + * Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <asm/mshyperv.h>
> +
> +#define HV_GET_REGISTER_BATCH_SIZE \
> + (HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
> +#define HV_SET_REGISTER_BATCH_SIZE \
> + ((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
> + / sizeof(struct hv_register_assoc))
> +
> +int hv_call_get_vp_registers(
> + u32 vp_index,
> + u64 partition_id,
> + u16 count,
> + union hv_input_vtl input_vtl,
> + struct hv_register_assoc *registers)
> +{
> + struct hv_input_get_vp_registers *input_page;
> + union hv_register_value *output_page;
> + u16 completed = 0;
> + unsigned long remaining = count;
> + int rep_count, i;
> + u64 status;
> + unsigned long flags;
> +
> + local_irq_save(flags);
> +
> + input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> + input_page->partition_id = partition_id;
> + input_page->vp_index = vp_index;
> + input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> + input_page->rsvd_z8 = 0;
> + input_page->rsvd_z16 = 0;
> +
> + while (remaining) {
> + rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> + for (i = 0; i < rep_count; ++i)
> + input_page->names[i] = registers[i].name;
> +
> + status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
> rep_count,
> + 0, input_page, output_page);

Is there any possibility that count value is passed 0 by mistake ? In that case
status will remain uninitialized.

> + if (!hv_result_success(status)) {
> + pr_err("%s: completed %li out of %u, %s\n",
> + __func__,
> + count - remaining, count,
> + hv_status_to_string(status));
> + break;
> + }
> + completed = hv_repcomp(status);
> + for (i = 0; i < completed; ++i)
> + registers[i].value = output_page[i];
> +
> + registers += completed;
> + remaining -= completed;
> + }
> + local_irq_restore(flags);
> +
> + return hv_status_to_errno(status);
> +}
> +
> +int hv_call_set_vp_registers(
> + u32 vp_index,
> + u64 partition_id,
> + u16 count,
> + union hv_input_vtl input_vtl,
> + struct hv_register_assoc *registers)
> +{
> + struct hv_input_set_vp_registers *input_page;
> + u16 completed = 0;
> + unsigned long remaining = count;
> + int rep_count;
> + u64 status;
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> + input_page->partition_id = partition_id;
> + input_page->vp_index = vp_index;
> + input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> + input_page->rsvd_z8 = 0;
> + input_page->rsvd_z16 = 0;
> +
> + while (remaining) {
> + rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
> + memcpy(input_page->elements, registers,
> + sizeof(struct hv_register_assoc) * rep_count);
> +
> + status = hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS,
> rep_count,
> + 0, input_page, NULL);
> + if (!hv_result_success(status)) {
> + pr_err("%s: completed %li out of %u, %s\n",
> + __func__,
> + count - remaining, count,
> + hv_status_to_string(status));
> + break;
> + }
> + completed = hv_repcomp(status);
> + registers += completed;
> + remaining -= completed;
> + }
> +
> + local_irq_restore(flags);
> +
> + return hv_status_to_errno(status);
> +}
> +
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 13f972e72375..ccd76f30a638 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -62,7 +62,11 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg);
> */
> static inline bool hv_output_arg_exists(void)
> {
> +#ifdef CONFIG_MSHV_VTL

Although today both the option works together. But thinking
which is more accurate CONFIG_HYPERV_VTL_MODE or
CONFIG_MSHV_VTL here for scalability of VTL modules.

> + return true;
> +#else
> return hv_root_partition ? true : false;
> +#endif
> }
>
> static void hv_kmsg_dump_unregister(void);
> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
> new file mode 100644
> index 000000000000..166480a73f3f
> --- /dev/null
> +++ b/drivers/hv/mshv.h
> @@ -0,0 +1,156 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + */
> +
> +#ifndef _MSHV_H_
> +#define _MSHV_H_
> +
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/semaphore.h>
> +#include <linux/sched.h>
> +#include <linux/srcu.h>
> +#include <linux/wait.h>
> +#include <uapi/linux/mshv.h>
> +
> +/*
> + * Hyper-V hypercalls
> + */
> +
> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
> +int hv_call_create_partition(
> + u64 flags,
> + struct hv_partition_creation_properties creation_properties,
> + union hv_partition_isolation_properties isolation_properties,
> + u64 *partition_id);
> +int hv_call_initialize_partition(u64 partition_id);
> +int hv_call_finalize_partition(u64 partition_id);
> +int hv_call_delete_partition(u64 partition_id);
> +int hv_call_map_gpa_pages(
> + u64 partition_id,
> + u64 gpa_target,
> + u64 page_count, u32 flags,
> + struct page **pages);
> +int hv_call_unmap_gpa_pages(
> + u64 partition_id,
> + u64 gpa_target,
> + u64 page_count, u32 flags);
> +int hv_call_get_vp_registers(
> + u32 vp_index,
> + u64 partition_id,
> + u16 count,
> + union hv_input_vtl input_vtl,
> + struct hv_register_assoc *registers);
> +int hv_call_get_gpa_access_states(
> + u64 partition_id,
> + u32 count,
> + u64 gpa_base_pfn,
> + u64 state_flags,
> + int *written_total,
> + union hv_gpa_page_access_state *states);
> +
> +int hv_call_set_vp_registers(
> + u32 vp_index,
> + u64 partition_id,
> + u16 count,
> + union hv_input_vtl input_vtl,
> + struct hv_register_assoc *registers);

Nit: Opportunity to fix many of the checkpatch.pl related to line break here
and many other places.

> +int hv_call_install_intercept(u64 partition_id, u32 access_type,
> + enum hv_intercept_type intercept_type,
> + union hv_intercept_parameters intercept_parameter);
> +int hv_call_assert_virtual_interrupt(
> + u64 partition_id,
> + u32 vector,
> + u64 dest_addr,
> + union hv_interrupt_control control);
> +int hv_call_clear_virtual_interrupt(u64 partition_id);
> +
> +#ifdef HV_SUPPORTS_VP_STATE
> +int hv_call_get_vp_state(
> + u32 vp_index,
> + u64 partition_id,
> + enum hv_get_set_vp_state_type type,
> + struct hv_vp_state_data_xsave xsave,
> + /* Choose between pages and ret_output */
> + u64 page_count,
> + struct page **pages,
> + union hv_output_get_vp_state *ret_output);
> +int hv_call_set_vp_state(
> + u32 vp_index,
> + u64 partition_id,
> + enum hv_get_set_vp_state_type type,
> + struct hv_vp_state_data_xsave xsave,
> + /* Choose between pages and bytes */
> + u64 page_count,
> + struct page **pages,
> + u32 num_bytes,
> + u8 *bytes);
> +#endif
> +
> +int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> + struct page **state_page);
> +int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32
> type);
> +int hv_call_get_partition_property(
> + u64 partition_id,
> + u64 property_code,
> + u64 *property_value);
> +int hv_call_set_partition_property(
> + u64 partition_id, u64 property_code, u64 property_value,
> + void (*completion_handler)(void * /* data */, u64 * /* status */),
> + void *completion_data);
> +int hv_call_translate_virtual_address(
> + u32 vp_index,
> + u64 partition_id,
> + u64 flags,
> + u64 gva,
> + u64 *gpa,
> + union hv_translate_gva_result *result);
> +int hv_call_get_vp_cpuid_values(
> + u32 vp_index,
> + u64 partition_id,
> + union hv_get_vp_cpuid_values_flags values_flags,
> + struct hv_cpuid_leaf_info *info,
> + union hv_output_get_vp_cpuid_values *result);
> +
> +int hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> + u64 connection_partition_id, struct hv_port_info
> *port_info,
> + u8 port_vtl, u8 min_connection_vtl, int node);
> +int hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id);
> +int hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> + u64 connection_partition_id,
> + union hv_connection_id connection_id,
> + struct hv_connection_info *connection_info,
> + u8 connection_vtl, int node);
> +int hv_call_disconnect_port(u64 connection_partition_id,
> + union hv_connection_id connection_id);
> +int hv_call_notify_port_ring_empty(u32 sint_index);
> +#ifdef HV_SUPPORTS_REGISTER_INTERCEPT
> +int hv_call_register_intercept_result(u32 vp_index,
> + u64 partition_id,
> + enum hv_intercept_type intercept_type,
> + union
> hv_register_intercept_result_parameters *params);
> +#endif
> +int hv_call_signal_event_direct(u32 vp_index,
> + u64 partition_id,
> + u8 vtl,
> + u8 sint,
> + u16 flag_number,
> + u8 *newly_signaled);
> +int hv_call_post_message_direct(u32 vp_index,
> + u64 partition_id,
> + u8 vtl,
> + u32 sint_index,
> + u8 *message);
> +
> +struct mshv_partition *mshv_partition_find(u64 partition_id)
> __must_hold(RCU);
> +
> +int mshv_xfer_to_guest_mode_handle_work(unsigned long ti_work);
> +
> +typedef long (*mshv_create_func_t)(void __user *user_arg);
> +typedef long (*mshv_check_ext_func_t)(u32 arg);
> +int mshv_setup_vtl_func(const mshv_create_func_t create_vtl,
> + const mshv_check_ext_func_t check_ext);
> +int mshv_set_create_partition_func(const mshv_create_func_t func);
> +
> +#endif /* _MSHV_H */
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> new file mode 100644
> index 000000000000..ddc64fe3920e
> --- /dev/null
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -0,0 +1,758 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * eventfd support for mshv
> + *
> + * Heavily inspired from KVM implementation of irqfd/ioeventfd. The basic
> + * framework code is taken from the kvm implementation.
> + *
> + * All credits to kvm developers.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +#include <linux/eventfd.h>
> +
> +#include "mshv_eventfd.h"
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +static struct workqueue_struct *irqfd_cleanup_wq;
> +
> +void
> +mshv_register_irq_ack_notifier(struct mshv_partition *partition,
> + struct mshv_irq_ack_notifier *mian)
> +{
> + mutex_lock(&partition->irq_lock);
> + hlist_add_head_rcu(&mian->link, &partition->irq_ack_notifier_list);
> + mutex_unlock(&partition->irq_lock);
> +}
> +
> +void
> +mshv_unregister_irq_ack_notifier(struct mshv_partition *partition,
> + struct mshv_irq_ack_notifier *mian)
> +{
> + mutex_lock(&partition->irq_lock);
> + hlist_del_init_rcu(&mian->link);
> + mutex_unlock(&partition->irq_lock);
> + synchronize_rcu();
> +}
> +
> +bool
> +mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi)
> +{
> + struct mshv_irq_ack_notifier *mian;
> + bool acked = false;
> +
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mian, &partition->irq_ack_notifier_list,
> + link) {
> + if (mian->gsi == gsi) {
> + mian->irq_acked(mian);
> + acked = true;
> + }
> + }
> + rcu_read_unlock();
> +
> + return acked;
> +}
> +
> +static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
> +{
> + return type == HV_X64_INTERRUPT_TYPE_EXTINT;
> +}
> +
> +static void
> +irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
> +{
> + struct mshv_kernel_irqfd_resampler *resampler;
> + struct mshv_partition *partition;
> + struct mshv_kernel_irqfd *irqfd;
> + int idx;
> +
> + resampler = container_of(mian,
> + struct mshv_kernel_irqfd_resampler, notifier);
> + partition = resampler->partition;
> +
> + idx = srcu_read_lock(&partition->irq_srcu);
> +
> + hlist_for_each_entry_rcu(irqfd, &resampler->irqfds_list,
> resampler_hnode) {
> + if (hv_should_clear_interrupt(irqfd-
> >lapic_irq.control.interrupt_type))
> + hv_call_clear_virtual_interrupt(partition->id);
> +
> + eventfd_signal(irqfd->resamplefd, 1);
> + }
> +
> + srcu_read_unlock(&partition->irq_srcu, idx);
> +}
> +
> +static void
> +irqfd_assert(struct work_struct *work)
> +{
> + struct mshv_kernel_irqfd *irqfd =
> + container_of(work, struct mshv_kernel_irqfd, assert);
> + struct mshv_lapic_irq *irq = &irqfd->lapic_irq;
> +
> + hv_call_assert_virtual_interrupt(irqfd->partition->id,
> + irq->vector, irq->apic_id,
> + irq->control);
> +}
> +
> +static void
> +irqfd_inject(struct mshv_kernel_irqfd *irqfd)
> +{
> + struct mshv_partition *partition = irqfd->partition;
> + struct mshv_lapic_irq *irq = &irqfd->lapic_irq;
> + unsigned int seq;
> + int idx;
> +
> + WARN_ON(irqfd->resampler &&
> + !irq->control.level_triggered);
> +
> + idx = srcu_read_lock(&partition->irq_srcu);
> + if (irqfd->msi_entry.gsi) {
> + if (!irqfd->msi_entry.entry_valid) {
> + pr_warn("Invalid routing info for gsi %u",
> + irqfd->msi_entry.gsi);
> + srcu_read_unlock(&partition->irq_srcu, idx);
> + return;
> + }
> +
> + do {
> + seq = read_seqcount_begin(&irqfd->msi_entry_sc);
> + } while (read_seqcount_retry(&irqfd->msi_entry_sc, seq));
> + }
> +
> + srcu_read_unlock(&partition->irq_srcu, idx);
> +
> + schedule_work(&irqfd->assert);
> +}
> +
> +static void
> +irqfd_resampler_shutdown(struct mshv_kernel_irqfd *irqfd)
> +{
> + struct mshv_kernel_irqfd_resampler *resampler = irqfd->resampler;
> + struct mshv_partition *partition = resampler->partition;
> +
> + mutex_lock(&partition->irqfds.resampler_lock);
> +
> + hlist_del_rcu(&irqfd->resampler_hnode);
> + synchronize_srcu(&partition->irq_srcu);
> +
> + if (hlist_empty(&resampler->irqfds_list)) {
> + hlist_del(&resampler->hnode);
> + mshv_unregister_irq_ack_notifier(partition, &resampler-
> >notifier);
> + kfree(resampler);
> + }
> +
> + mutex_unlock(&partition->irqfds.resampler_lock);
> +}
> +
> +/*
> + * Race-free decouple logic (ordering is critical)
> + */
> +static void
> +irqfd_shutdown(struct work_struct *work)
> +{
> + struct mshv_kernel_irqfd *irqfd =
> + container_of(work, struct mshv_kernel_irqfd, shutdown);
> +
> + /*
> + * Synchronize with the wait-queue and unhook ourselves to prevent
> + * further events.
> + */
> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +
> + if (irqfd->resampler) {
> + irqfd_resampler_shutdown(irqfd);
> + eventfd_ctx_put(irqfd->resamplefd);
> + }
> +
> + /*
> + * We know no new events will be scheduled at this point, so block
> + * until all previously outstanding events have completed
> + */
> + flush_work(&irqfd->assert);
> +
> + /*
> + * It is now safe to release the object's resources
> + */
> + eventfd_ctx_put(irqfd->eventfd);
> + kfree(irqfd);
> +}
> +
> +/* assumes partition->irqfds.lock is held */
> +static bool
> +irqfd_is_active(struct mshv_kernel_irqfd *irqfd)
> +{
> + return !hlist_unhashed(&irqfd->hnode);
> +}
> +
> +/*
> + * Mark the irqfd as inactive and schedule it for removal
> + *
> + * assumes partition->irqfds.lock is held
> + */
> +static void
> +irqfd_deactivate(struct mshv_kernel_irqfd *irqfd)
> +{
> + WARN_ON(!irqfd_is_active(irqfd));
> +
> + hlist_del(&irqfd->hnode);
> +
> + queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
> +}
> +
> +/*
> + * Called with wqh->lock held and interrupts disabled
> + */
> +static int
> +irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> + int sync, void *key)
> +{
> + struct mshv_kernel_irqfd *irqfd =
> + container_of(wait, struct mshv_kernel_irqfd, wait);
> + unsigned long flags = (unsigned long)key;
> + int idx;
> + unsigned int seq;
> + struct mshv_partition *partition = irqfd->partition;
> + int ret = 0;
> +
> + if (flags & POLLIN) {
> + u64 cnt;
> +
> + eventfd_ctx_do_read(irqfd->eventfd, &cnt);
> + idx = srcu_read_lock(&partition->irq_srcu);
> + do {
> + seq = read_seqcount_begin(&irqfd->msi_entry_sc);
> + } while (read_seqcount_retry(&irqfd->msi_entry_sc, seq));
> +
> + /* An event has been signaled, inject an interrupt */
> + irqfd_inject(irqfd);
> + srcu_read_unlock(&partition->irq_srcu, idx);
> +
> + ret = 1;
> + }
> +
> + if (flags & POLLHUP) {
> + /* The eventfd is closing, detach from Partition */
> + unsigned long flags;
> +
> + spin_lock_irqsave(&partition->irqfds.lock, flags);
> +
> + /*
> + * We must check if someone deactivated the irqfd before
> + * we could acquire the irqfds.lock since the item is
> + * deactivated from the mshv side before it is unhooked from
> + * the wait-queue. If it is already deactivated, we can
> + * simply return knowing the other side will cleanup for us.
> + * We cannot race against the irqfd going away since the
> + * other side is required to acquire wqh->lock, which we hold
> + */
> + if (irqfd_is_active(irqfd))
> + irqfd_deactivate(irqfd);
> +
> + spin_unlock_irqrestore(&partition->irqfds.lock, flags);
> + }
> +
> + return ret;
> +}
> +
> +/* Must be called under irqfds.lock */
> +static void irqfd_update(struct mshv_partition *partition,
> + struct mshv_kernel_irqfd *irqfd)
> +{
> + write_seqcount_begin(&irqfd->msi_entry_sc);
> + irqfd->msi_entry = mshv_msi_map_gsi(partition, irqfd->gsi);
> + mshv_set_msi_irq(&irqfd->msi_entry, &irqfd->lapic_irq);
> + write_seqcount_end(&irqfd->msi_entry_sc);
> +}
> +
> +void mshv_irqfd_routing_update(struct mshv_partition *partition)
> +{
> + struct mshv_kernel_irqfd *irqfd;
> +
> + spin_lock_irq(&partition->irqfds.lock);
> + hlist_for_each_entry(irqfd, &partition->irqfds.items, hnode)
> + irqfd_update(partition, irqfd);
> + spin_unlock_irq(&partition->irqfds.lock);
> +}
> +
> +static void
> +irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> + poll_table *pt)
> +{
> + struct mshv_kernel_irqfd *irqfd =
> + container_of(pt, struct mshv_kernel_irqfd, pt);
> +
> + irqfd->wqh = wqh;
> + add_wait_queue_priority(wqh, &irqfd->wait);
> +}
> +
> +static int
> +mshv_irqfd_assign(struct mshv_partition *partition,
> + struct mshv_irqfd *args)
> +{
> + struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
> + struct mshv_kernel_irqfd *irqfd, *tmp;
> + unsigned int events;
> + struct fd f;
> + int ret;
> + int idx;
> +
> + irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> + if (!irqfd)
> + return -ENOMEM;
> +
> + irqfd->partition = partition;
> + irqfd->gsi = args->gsi;
> + INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
> + INIT_WORK(&irqfd->assert, irqfd_assert);
> + seqcount_spinlock_init(&irqfd->msi_entry_sc,
> + &partition->irqfds.lock);
> +
> + f = fdget(args->fd);
> + if (!f.file) {
> + ret = -EBADF;
> + goto out;
> + }
> +
> + eventfd = eventfd_ctx_fileget(f.file);
> + if (IS_ERR(eventfd)) {
> + ret = PTR_ERR(eventfd);
> + goto fail;
> + }
> +
> + irqfd->eventfd = eventfd;
> +
> + if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE) {
> + struct mshv_kernel_irqfd_resampler *resampler;
> +
> + resamplefd = eventfd_ctx_fdget(args->resamplefd);
> + if (IS_ERR(resamplefd)) {
> + ret = PTR_ERR(resamplefd);
> + goto fail;
> + }
> +
> + irqfd->resamplefd = resamplefd;
> +
> + mutex_lock(&partition->irqfds.resampler_lock);
> +
> + hlist_for_each_entry(resampler,
> + &partition->irqfds.resampler_list, hnode) {
> + if (resampler->notifier.gsi == irqfd->gsi) {
> + irqfd->resampler = resampler;
> + break;
> + }
> + }
> +
> + if (!irqfd->resampler) {
> + resampler = kzalloc(sizeof(*resampler),
> + GFP_KERNEL_ACCOUNT);
> + if (!resampler) {
> + ret = -ENOMEM;
> + mutex_unlock(&partition-
> >irqfds.resampler_lock);
> + goto fail;
> + }
> +
> + resampler->partition = partition;
> + INIT_HLIST_HEAD(&resampler->irqfds_list);
> + resampler->notifier.gsi = irqfd->gsi;
> + resampler->notifier.irq_acked = irqfd_resampler_ack;
> +
> + hlist_add_head(&resampler->hnode, &partition-
> >irqfds.resampler_list);
> + mshv_register_irq_ack_notifier(partition,
> + &resampler->notifier);
> + irqfd->resampler = resampler;
> + }
> +
> + hlist_add_head_rcu(&irqfd->resampler_hnode, &irqfd-
> >resampler->irqfds_list);
> +
> + mutex_unlock(&partition->irqfds.resampler_lock);
> + }
> +
> + /*
> + * Install our own custom wake-up handling so we are notified via
> + * a callback whenever someone signals the underlying eventfd
> + */
> + init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> + init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> +
> + spin_lock_irq(&partition->irqfds.lock);
> + if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE &&
> + !irqfd->lapic_irq.control.level_triggered) {
> + /*
> + * Resample Fd must be for level triggered interrupt
> + * Otherwise return with failure
> + */
> + spin_unlock_irq(&partition->irqfds.lock);
> + ret = -EINVAL;
> + goto fail;
> + }
> + ret = 0;
> + hlist_for_each_entry(tmp, &partition->irqfds.items, hnode) {
> + if (irqfd->eventfd != tmp->eventfd)
> + continue;
> + /* This fd is used for another irq already. */
> + ret = -EBUSY;
> + spin_unlock_irq(&partition->irqfds.lock);
> + goto fail;
> + }
> +
> + idx = srcu_read_lock(&partition->irq_srcu);
> + irqfd_update(partition, irqfd);
> + hlist_add_head(&irqfd->hnode, &partition->irqfds.items);
> + spin_unlock_irq(&partition->irqfds.lock);
> +
> + /*
> + * Check if there was an event already pending on the eventfd
> + * before we registered, and trigger it as if we didn't miss it.
> + */
> + events = vfs_poll(f.file, &irqfd->pt);
> +
> + if (events & POLLIN)
> + irqfd_inject(irqfd);
> +
> + srcu_read_unlock(&partition->irq_srcu, idx);
> + /*
> + * do not drop the file until the irqfd is fully initialized, otherwise
> + * we might race against the POLLHUP
> + */
> + fdput(f);
> +
> + return 0;
> +
> +fail:
> + if (irqfd->resampler)
> + irqfd_resampler_shutdown(irqfd);
> +
> + if (resamplefd && !IS_ERR(resamplefd))
> + eventfd_ctx_put(resamplefd);
> +
> + if (eventfd && !IS_ERR(eventfd))
> + eventfd_ctx_put(eventfd);
> +
> + fdput(f);
> +
> +out:
> + kfree(irqfd);
> + return ret;
> +}
> +
> +/*
> + * shutdown any irqfd's that match fd+gsi
> + */
> +static int
> +mshv_irqfd_deassign(struct mshv_partition *partition,
> + struct mshv_irqfd *args)
> +{
> + struct mshv_kernel_irqfd *irqfd;
> + struct hlist_node *n;
> + struct eventfd_ctx *eventfd;
> +
> + eventfd = eventfd_ctx_fdget(args->fd);
> + if (IS_ERR(eventfd))
> + return PTR_ERR(eventfd);
> +
> + hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode) {
> + if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi)
> + irqfd_deactivate(irqfd);
> + }
> +
> + eventfd_ctx_put(eventfd);
> +
> + /*
> + * Block until we know all outstanding shutdown jobs have completed
> + * so that we guarantee there will not be any more interrupts on this
> + * gsi once this deassign function returns.
> + */
> + flush_workqueue(irqfd_cleanup_wq);
> +
> + return 0;
> +}
> +
> +int
> +mshv_irqfd(struct mshv_partition *partition, struct mshv_irqfd *args)
> +{
> + if (args->flags & MSHV_IRQFD_FLAG_DEASSIGN)
> + return mshv_irqfd_deassign(partition, args);
> +
> + return mshv_irqfd_assign(partition, args);
> +}
> +
> +/*
> + * This function is called as the mshv VM fd is being released.
> + * Shutdown all irqfds that still remain open
> + */
> +static void
> +mshv_irqfd_release(struct mshv_partition *partition)
> +{
> + struct mshv_kernel_irqfd *irqfd;
> + struct hlist_node *n;
> +
> + spin_lock_irq(&partition->irqfds.lock);
> +
> + hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode)
> + irqfd_deactivate(irqfd);
> +
> + spin_unlock_irq(&partition->irqfds.lock);
> +
> + /*
> + * Block until we know all outstanding shutdown jobs have completed
> + * since we do not take a mshv_partition* reference.
> + */
> + flush_workqueue(irqfd_cleanup_wq);
> +
> +}
> +
> +int mshv_irqfd_wq_init(void)
> +{
> + irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", 0, 0);
> + if (!irqfd_cleanup_wq)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +void mshv_irqfd_wq_cleanup(void)
> +{
> + destroy_workqueue(irqfd_cleanup_wq);
> +}
> +
> +/*
> + * --------------------------------------------------------------------
> + * ioeventfd: translate a MMIO memory write to an eventfd signal.
> + *
> + * userspace can register a MMIO address with an eventfd for receiving
> + * notification when the memory has been touched.
> + *
> + * TODO: Implement eventfd for PIO as well.
> + * --------------------------------------------------------------------
> + */
> +
> +static void
> +ioeventfd_release(struct kernel_mshv_ioeventfd *p, u64 partition_id)
> +{
> + if (p->doorbell_id > 0)
> + mshv_unregister_doorbell(partition_id, p->doorbell_id);
> + eventfd_ctx_put(p->eventfd);
> + kfree(p);
> +}
> +
> +/* MMIO writes trigger an event if the addr/val match */
> +static void
> +ioeventfd_mmio_write(int doorbell_id, void *data)
> +{
> + struct mshv_partition *partition = (struct mshv_partition *)data;
> + struct kernel_mshv_ioeventfd *p;
> +
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(p, &partition->ioeventfds.items, hnode) {
> + if (p->doorbell_id == doorbell_id) {
> + eventfd_signal(p->eventfd, 1);
> + break;
> + }
> + }
> + rcu_read_unlock();
> +}
> +
> +static bool
> +ioeventfd_check_collision(struct mshv_partition *partition,
> + struct kernel_mshv_ioeventfd *p)
> + __must_hold(&partition->mutex)
> +{
> + struct kernel_mshv_ioeventfd *_p;
> +
> + hlist_for_each_entry(_p, &partition->ioeventfds.items, hnode)
> + if (_p->addr == p->addr && _p->length == p->length &&
> + (_p->wildcard || p->wildcard ||
> + _p->datamatch == p->datamatch))
> + return true;
> +
> + return false;
> +}
> +
> +static int
> +mshv_assign_ioeventfd(struct mshv_partition *partition,
> + struct mshv_ioeventfd *args)
> + __must_hold(&partition->mutex)
> +{
> + struct kernel_mshv_ioeventfd *p;
> + struct eventfd_ctx *eventfd;
> + u64 doorbell_flags = 0;
> + int ret;
> +
> + /* This mutex is currently protecting ioeventfd.items list */
> + WARN_ON_ONCE(!mutex_is_locked(&partition->mutex));
> +
> + if (args->flags & MSHV_IOEVENTFD_FLAG_PIO)
> + return -EOPNOTSUPP;
> +
> + /* must be natural-word sized */
> + switch (args->len) {
> + case 0:
> + doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_ANY;
> + break;
> + case 1:
> + doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_BYTE;
> + break;
> + case 2:
> + doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_WORD;
> + break;
> + case 4:
> + doorbell_flags =
> HV_DOORBELL_FLAG_TRIGGER_SIZE_DWORD;
> + break;
> + case 8:
> + doorbell_flags =
> HV_DOORBELL_FLAG_TRIGGER_SIZE_QWORD;
> + break;
> + default:
> + pr_warn("ioeventfd: invalid length specified\n");
> + return -EINVAL;
> + }
> +
> + /* check for range overflow */
> + if (args->addr + args->len < args->addr)
> + return -EINVAL;
> +
> + /* check for extra flags that we don't understand */
> + if (args->flags & ~MSHV_IOEVENTFD_VALID_FLAG_MASK)
> + return -EINVAL;
> +
> + eventfd = eventfd_ctx_fdget(args->fd);
> + if (IS_ERR(eventfd))
> + return PTR_ERR(eventfd);
> +
> + p = kzalloc(sizeof(*p), GFP_KERNEL);
> + if (!p) {
> + ret = -ENOMEM;
> + goto fail;
> + }
> +
> + p->addr = args->addr;
> + p->length = args->len;
> + p->eventfd = eventfd;
> +
> + /* The datamatch feature is optional, otherwise this is a wildcard */
> + if (args->flags & MSHV_IOEVENTFD_FLAG_DATAMATCH)
> + p->datamatch = args->datamatch;
> + else {
> + p->wildcard = true;
> + doorbell_flags |=
> HV_DOORBELL_FLAG_TRIGGER_ANY_VALUE;
> + }
> +
> + if (ioeventfd_check_collision(partition, p)) {
> + ret = -EEXIST;
> + goto unlock_fail;
> + }
> +
> + ret = mshv_register_doorbell(partition->id, ioeventfd_mmio_write,
> + (void *)partition, p->addr,
> + p->datamatch, doorbell_flags);
> + if (ret < 0) {
> + pr_err("Failed to register ioeventfd doorbell!\n");

Nit: Do we like to print function name at the start of pr_err.

- Saurabh