Re: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

From: Nicolas Saenz Julienne
Date: Tue Feb 20 2024 - 13:20:15 EST


Hi Sean,

On Tue Feb 20, 2024 at 4:18 PM UTC, Sean Christopherson wrote:
> On Mon, Feb 19, 2024, Nicolas Saenz Julienne wrote:
> > Under certain extreme conditions, the tick-based cputime accounting may
> > produce inaccurate data. For instance, guest CPU usage is sensitive to
> > interrupts firing right before the tick's expiration. This forces the
> > guest into kernel context, and has that time slice wrongly accounted as
> > system time. This issue is exacerbated if the interrupt source is in
> > sync with the tick, significantly skewing usage metrics towards system
> > time.
>
> ...
>
> > NOTE: This wasn't tested in depth, and it's mostly intended to highlight
> > the issue we're trying to solve. Also ccing KVM folks, since it's
> > relevant to guest CPU usage accounting.
>
> How bad is the synchronization issue on upstream kernels? We tried to address
> that in commit 160457140187 ("KVM: x86: Defer vtime accounting 'til after IRQ handling").
>
> I don't expect it to be foolproof, but it'd be good to know if there's a blatant
> flaw and/or easily closed hole.

The issue is not really about the interrupts themselves, but their side
effects.

For instance, let's say the guest sets up an Hyper-V stimer that
consistently fires 1 us before the preemption tick. The preemption tick
will expire while the vCPU thread is running with !PF_VCPU (maybe inside
kvm_hv_process_stimers() for ex.). As long as they both keep in sync,
you'll get a 100% system usage. I was able to reproduce this one through
kvm-unit-tests, but the race window is too small to keep the interrupts
in sync for long periods of time, yet still capable of producing random
system usage bursts (which unacceptable for some use-cases).

Other use-cases have bigger race windows and managed to maintain high
system CPU usage over long periods of time. For example, with user-space
HPET emulation, or KVM+Xen (don't know the fine details on these, but
VIRT_CPU_ACCOUNTING_GEN fixes the mis-accounting). It all comes down to
the same situation. Something triggers an exit, and the vCPU thread goes
past 'vtime_account_guest_exit()' just in time for the tick interrupt to
show up.

Note that we're running with 160457140187 ("KVM: x86: Defer vtime
accounting 'til after IRQ handling"), on the kernel that reproduced
these issues. The RFC fix was tested against an upstream kernel by
tracing cputime accounting and making sure the right code-paths were
exercised.

Nicolas