Re: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

From: Sean Christopherson
Date: Wed Feb 21 2024 - 11:24:41 EST


On Tue, Feb 20, 2024, Nicolas Saenz Julienne wrote:
> Hi Sean,
>
> On Tue Feb 20, 2024 at 4:18 PM UTC, Sean Christopherson wrote:
> > On Mon, Feb 19, 2024, Nicolas Saenz Julienne wrote:
> > > Under certain extreme conditions, the tick-based cputime accounting may
> > > produce inaccurate data. For instance, guest CPU usage is sensitive to
> > > interrupts firing right before the tick's expiration.

Ah, this confused me. The "right before" is a bit misleading. It's more like
"shortly before", because if the interrupt that occurs due to the guest's tick
arrives _right_ before the host tick expires, then commit 160457140187 should
avoid horrific accounting.

> > > This forces the guest into kernel context, and has that time slice
> > > wrongly accounted as system time. This issue is exacerbated if the
> > > interrupt source is in sync with the tick,

It's worth calling out why this can happen, to make it clear that getting into
such syncopation can happen quite naturally. E.g. something like:

interrupt source is in sync with the tick, e.g. if the guest's tick
is configured to run at the same frequency as the host tick, and the
guest tick is every so slightly ahead of the host tick.

> > > significantly skewing usage metrics towards system time.
> >
> > ...
> >
> > > NOTE: This wasn't tested in depth, and it's mostly intended to highlight
> > > the issue we're trying to solve. Also ccing KVM folks, since it's
> > > relevant to guest CPU usage accounting.
> >
> > How bad is the synchronization issue on upstream kernels? We tried to address
> > that in commit 160457140187 ("KVM: x86: Defer vtime accounting 'til after IRQ handling").
> >
> > I don't expect it to be foolproof, but it'd be good to know if there's a blatant
> > flaw and/or easily closed hole.
>
> The issue is not really about the interrupts themselves, but their side
> effects.
>
> For instance, let's say the guest sets up an Hyper-V stimer that
> consistently fires 1 us before the preemption tick. The preemption tick
> will expire while the vCPU thread is running with !PF_VCPU (maybe inside
> kvm_hv_process_stimers() for ex.). As long as they both keep in sync,
> you'll get a 100% system usage. I was able to reproduce this one through
> kvm-unit-tests, but the race window is too small to keep the interrupts
> in sync for long periods of time, yet still capable of producing random
> system usage bursts (which unacceptable for some use-cases).
>
> Other use-cases have bigger race windows and managed to maintain high
> system CPU usage over long periods of time. For example, with user-space
> HPET emulation, or KVM+Xen (don't know the fine details on these, but
> VIRT_CPU_ACCOUNTING_GEN fixes the mis-accounting). It all comes down to
> the same situation. Something triggers an exit, and the vCPU thread goes
> past 'vtime_account_guest_exit()' just in time for the tick interrupt to
> show up.

I suspect the common "problem" with those flows is that emulating the guest timer
interrupt is (a) slow, relatively speaking and (b) done with interrupts enabled.

E.g. on VMX, the TSC deadline timer is emulated via VMX preemption timer, and both
the programming of the guest's TSC deadline timer and the handling of the expiration
interrupt is done in the VM-Exit fastpath with IRQs disabled. As a result, even
if the host tick interrupt is a hair behind the guest tick, it doesn't affect
accounting because the host tick interrupt will never be delivered while KVM is
emulating the guest's periodic tick.

I'm guessing that if you tested on SVM (or a guest that doesn't use the APIC timer
in deadline mode), which doesn't utilize the fastpath since KVM needs to bounce
through hrtimers, then you'd see similar accounting problems even without using
any of the problematic "slow" timer sources.