Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

From: Marcelo Tosatti
Date: Sat Apr 22 2023 - 21:34:56 EST


On Wed, Apr 19, 2023 at 01:35:12PM -0300, Marcelo Tosatti wrote:
> Hi Michal,
>
> On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> > [...]
> > > > 2) Run critical code
> > > > 3) Optionally do something once you're done
> > > >
> > > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > > clueless about those matters),
> > >
> > > I am mostly clueless too, but i don't see a problem with the proposed
> > > patch (and no one has pointed any problem either).
> >
> > I really hate to repeat myself again. The biggest pushback has been on
> > a) justification and b) single purpose solution which is very likely
> > incomplete. For a) we are getting the story piece by piece which doesn't
> > speed up the process. You are proposing a non-trivial change to an
> > already convoluted code so having a solid justification is something
> > that shouldn't be all that surprising.
>
> The justification is simple and concise:
>
> 2. With a task that busy loops on a given CPU,
> the kworker interruption to execute vmstat_update
> is undesired and may exceed latency thresholds
> for certain applications.
>
> Performance details for the kworker interruption:
>
> oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
>
> The example above shows an additional 7us for the
>
> oslat -> kworker -> oslat
>
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold for certain applications.
>
> ---
>
> An additional use-case is what has been noted by Andrew Theurer:
>
> Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
> as measured by cyclictest & oslat. We cannot achieve under 20 usec with
> the vmstats interruption.
>
> ---
>
> It seems to me this is solid justification (it seems you want
> particular microsecond values, but those depend on application
> and the CPU). The point is there are several applications which do not
> want to be interrupted (we can ignore the specifics and focus on
> that fact).
>
> Moreover, unrelated interruptions might occur close in time
> (for example, random function call IPIs generated by other
> kernel subsystems), which renders the "lets just consider this
> one application, running on this CPU, to decide what to do"
> short sighted.
>
> > b) is what concerns me more though. There are other per-cpu specific
> > things going on that require some regular flushing. Just to mention
> > another one that your group has been brought up was the memcg pcp
> > caches. Again with a non-trivial proposal to deal with that problem
> > [1].
>
> Yes.
>
> > It has turned out that we can do a simpler thing [2].
>
> For the particular memcg case, there was a simpler fix.
>
> For the vmstat_update case, i don't see a simpler fix.
>
> > I do not think it is a stretch to expect that similar things will pop
> > out every now and then
>
> Agree.
>
> > and rather than dealing with each one in its own way it
> > kinda makes sense to come up with a more general concept so that all
> > those cases can be handled at a single place at least.
>
> I can understand where you are coming from. Unfortunately,
> for some cases it is appropriate to perform the work from a
> remote CPU (and i think this is one such case).
>
> > All I hear about
> > that is that the code of those special applications would need to be
> > changed to use that.
>
> This is a burden for application writers and for system configuration.
>
> Or it could be done automatically (from outside of the application).
> Which is what is described and implemented here:
>
> https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/
>
> "Task isolation is divided in two main steps: configuration and
> activation.
>
> Each step can be performed by an external tool or the latency
> sensitive application itself. util-linux contains the "chisol" tool
> for this purpose."
>
> But not only that, the second thing is:
>
> "> Another important point is this: if an application dirties
> > its own per-CPU vmstat cache, while performing a system call,
>
> Or while handling a VM-exit from a vCPU.
>
> This are, in my mind, sufficient reasons to discard the "flush per-cpu
> caches" idea. This is also why i chose to abandon the prctrl interface
> patchset.

There are additional details that were not mentioned. When we think
of flushing caches, or disabling per-CPU caches, this means that the
isolated application loses the benefit of those caches (which means you
are turning a "general purpose" programming environment into
potentially slower environment for applications to execute).

(yes, of course, one has to be mindful of which system calls can be
used, for example the execution time of system calls which take locks will
depend on whether, and how many, users of those locks there are at a
given moment).

For example, if we flush the vm-stats at every system call return,
and the application happens to switch between different phases of

isolated, userspace only behaviour (computation)
system call intensive behaviour

(which is an HPC example Thomas Gleixner mentioned in a different
thread), then you see the disadvantage of the "special" environment
where flushing is performed on return to system calls.

So it seems to me (unless there are points that show otherwise, which
would indicate that explicit userspace interfaces are preferred) _not_
requiring userspace changes is a superior solution.

Perhaps the complexity should be judged for individual cases
of interruptions, and if a given interruption-free conversion
is seen as too complex, then a "disable feature which makes use of per-CPU
caches" style solution can be made (and then userspace has to
explicitly request for that per-CPU feature to be disabled).

But i don't see that this patchset introduces unmanageable complexity,
neither:

01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock