Re: [PATCH] tick: Detect and fix jiffies update stall

From: Frederic Weisbecker
Date: Tue Feb 01 2022 - 21:23:39 EST


On Tue, Feb 01, 2022 at 05:49:34PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 02, 2022 at 01:01:07AM +0100, Frederic Weisbecker wrote:
> > On some rare cases, the timekeeper CPU may be delaying its jiffies
> > update duty for a while. Known causes include:
> >
> > * The timekeeper is waiting on stop_machine in a MULTI_STOP_DISABLE_IRQ
> > or MULTI_STOP_RUN state. Disabled interrupts prevent from timekeeping
> > updates while waiting for the target CPU to complete its
> > stop_machine() callback.
> >
> > * The timekeeper vcpu has VMEXIT'ed for a long while due to some overload
> > on the host.
> >
> > Detect and fix these situations with emergency timekeeping catchups.
> >
> > Original-patch-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>
> Nice, thank you!
>
> So I should revert your earlier patch, apply this one, and then test
> the result?

No need to revert the nohz_full fix, this new one deals with non-dynticks
issues. This way we cover every timekeeper stall situations:

_ dynticks-idle is handled on IRQ entry

_ full dynticks is handled on IRQ entry in case of CPU 0 (traditional nohz_full
timekeeper) timekeeping stall. Let's hope we won't need to handle syscalls and
faults as well but we'll see...

_ periodic ticks are now handled on the tick.

So you just need to apply this patch on your dev branch for testing.

Thanks!