[PATCH v3 2/3] vmstat: skip periodic vmstat update for isolated CPUs

From: Marcelo Tosatti
Date: Mon Jun 05 2023 - 15:04:12 EST


Problem: The interruption caused by vmstat_update is undesirable
for certain applications.

With workloads that are running on isolated cpus with nohz full mode to
shield off any kernel interruption. For example, a VM running a
time sensitive application with a 50us maximum acceptable interruption
(use case: soft PLC).

oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the
oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold.

The isolated vCPU can perform operations that modify per-CPU page counters,
for example to complete I/O operations:

CPU 11/KVM-9540 [001] dNh1. 2314.248584: mod_zone_page_state <-__folio_end_writeback
CPU 11/KVM-9540 [001] dNh1. 2314.248585: <stack trace>
=> 0xffffffffc042b083
=> mod_zone_page_state
=> __folio_end_writeback
=> folio_end_writeback
=> iomap_finish_ioend
=> blk_mq_end_request_batch
=> nvme_irq
=> __handle_irq_event_percpu
=> handle_irq_event
=> handle_edge_irq
=> __common_interrupt
=> common_interrupt
=> asm_common_interrupt
=> vmx_do_interrupt_nmi_irqoff
=> vmx_handle_exit_irqoff
=> vcpu_enter_guest
=> vcpu_run
=> kvm_arch_vcpu_ioctl_run
=> kvm_vcpu_ioctl
=> __x64_sys_ioctl
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

In kernel users of vmstat counters either require the precise value and
they are using zone_page_state_snapshot interface or they can live with
an imprecision as the regular flushing can happen at arbitrary time and
cumulative error can grow (see calculate_normal_threshold).

>From that POV the regular flushing can be postponed for CPUs that have
been isolated from the kernel interference without critical
infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
for all isolated CPUs to avoid interference with the isolated workload.

Suggested by Michal Hocko.

Acked-by: Michal Hocko <mhocko@xxxxxxxx>
Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>

---

v3: improve changelog (Michal Hocko)
v2: use cpu_is_isolated (Michal Hocko)

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -28,6 +28,7 @@
#include <linux/mm_inline.h>
#include <linux/page_ext.h>
#include <linux/page_owner.h>
+#include <linux/sched/isolation.h>

#include "internal.h"

@@ -2022,6 +2023,20 @@ static void vmstat_shepherd(struct work_
for_each_online_cpu(cpu) {
struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

+ /*
+ * In kernel users of vmstat counters either require the precise value and
+ * they are using zone_page_state_snapshot interface or they can live with
+ * an imprecision as the regular flushing can happen at arbitrary time and
+ * cumulative error can grow (see calculate_normal_threshold).
+ *
+ * From that POV the regular flushing can be postponed for CPUs that have
+ * been isolated from the kernel interference without critical
+ * infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
+ * for all isolated CPUs to avoid interference with the isolated workload.
+ */
+ if (cpu_is_isolated(cpu))
+ continue;
+
if (!delayed_work_pending(dw) && need_update(cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);