Re: Early timeouts due to inaccurate jiffies during system suspend/resume

From: Thomas Gleixner
Date: Thu Apr 26 2018 - 17:40:24 EST


On Tue, 24 Apr 2018, Imre Deak wrote:
> On Mon, Apr 23, 2018 at 08:01:28PM +0300, Imre Deak wrote:
> > On Thu, Apr 19, 2018 at 01:05:39PM +0200, Thomas Gleixner wrote:
> > > On Thu, 19 Apr 2018, Imre Deak wrote:
> > > > Hi,
> > > >
> > > > while checking bug [1], I noticed that jiffies based timing loops like
> > > >
> > > > expire = jiffies + timeout + 1;
> > > > while (!time_after(jiffies, expire))
> > > > do_something;
> > > >
> > > > can last shorter than expected (that is less than timeout).
> > >
> > > Yes, that can happen when the timer interrupt is delayed long enough for
> > > whatever reason. If you need accurate timing then you need to use
> > > ktime_get().
> >
> > Thanks. I always regarded jiffies as non-accurate, but something that
> > gives a minimum time delay guarantee (when adjusted by +1 as above). I
> > wonder if there are other callers in kernel that don't expect an early
> > timeout.
>
> msleep and any other schedule_timeout based waits are also affected. At the
> same time for example msleep's documentation says:
> "msleep - sleep safely even with waitqueue interruptions".
>
> To me that suggests a wait with a minimum guaranteed delay.

Kinda :) The problem with jiffies is that it's a software maintained
counter which depends on interrupt delivery. Contrary to hardware based
counters which just work (most of the time at least).

> Ville had an idea to make the behavior more deterministic by clamping
> the jiffies increment to 1 for each timer interrupt. Would that work?

In theory, but there is the problem with NOHZ. NOHZ idle allows the CPU to
sleep for more than 1 jiffie in order to safe power by not waking up just
to increment jiffies and go back to sleep. So we need to push jiffies
forward when the system was completely idle for some time. We already make
sure that jiffies are updated on interrupt entry from idle before any code
relying on them is run.

Now for the weird case where interrupts get delayed awfully long, the right
answer is to break these long interrupt disabled sections. Anything which
holds interrupts disabled longer than a couple of microseconds is broken.

Thanks,

tglx