[PATCH v3 3/3] mm/vmstat: do not refresh stats for isolated CPUs

From: Marcelo Tosatti
Date: Mon Jun 05 2023 - 15:04:06 EST


schedule_work_on API uses the workqueue mechanism to
queue a work item on a queue. A kernel thread, which
runs on the target CPU, executes those work items.

Therefore, when using the schedule_work_on API,
it is necessary for the kworker kernel thread to
be scheduled in, for the work function to be executed.

Time sensitive applications such as SoftPLCs
(https://tum-esi.github.io/publications-list/PDF/2022-ETFA-How_Real_Time_Are_Virtual_PLCs.pdf),
have their response times affected by such interruptions.

The /proc/sys/vm/stat_refresh file was originally introduced
with the goal to:

"Provide /proc/sys/vm/stat_refresh to force an immediate update of
per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
before checking counts when testing. Originally added to work around a
bug which left counts stranded indefinitely on a cpu going idle (an
inaccuracy magnified when small below-batch numbers represent "huge"
amounts of memory), but I believe that bug is now fixed: nonetheless,
this is still a useful knob."

Other than the potential interruption to a time sensitive application,
if using SCHED_FIFO or SCHED_RR priority on the isolated CPU, then
system hangs can occur:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=978688

To avoid the problems above, do not schedule the work to synchronize
per-CPU mm counters on isolated CPUs. Given the possibility for
breaking existing userspace applications, avoid returning
errors from access to /proc/sys/vm/stat_refresh.

Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>

---
v3: improve changelog (Michal Hocko)
v2: opencode schedule_on_each_cpu (Michal Hocko)

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -1881,8 +1881,13 @@ int vmstat_refresh(struct ctl_table *tab
void *buffer, size_t *lenp, loff_t *ppos)
{
long val;
- int err;
int i;
+ int cpu;
+ struct work_struct __percpu *works;
+
+ works = alloc_percpu(struct work_struct);
+ if (!works)
+ return -ENOMEM;

/*
* The regular update, every sysctl_stat_interval, may come later
@@ -1896,9 +1901,24 @@ int vmstat_refresh(struct ctl_table *tab
* transiently negative values, report an error here if any of
* the stats is negative, so we know to go looking for imbalance.
*/
- err = schedule_on_each_cpu(refresh_vm_stats);
- if (err)
- return err;
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ struct work_struct *work;
+
+ if (cpu_is_isolated(cpu))
+ continue;
+ work = per_cpu_ptr(works, cpu);
+ INIT_WORK(work, refresh_vm_stats);
+ schedule_work_on(cpu, work);
+ }
+
+ for_each_online_cpu(cpu) {
+ if (cpu_is_isolated(cpu))
+ continue;
+ flush_work(per_cpu_ptr(works, cpu));
+ }
+ cpus_read_unlock();
+ free_percpu(works);
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
/*
* Skip checking stats known to go negative occasionally.