Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

From: Thomas Gleixner
Date: Wed Jan 20 2016 - 09:28:06 EST


Jeff,

On Wed, 20 Jan 2016, Thomas Gleixner wrote:
> On Tue, 19 Jan 2016, Jeff Merkey wrote:
> > Nasty bug but trivial fix for this. What happens here is RAX (nsecs)
> > gets set to a huge value (RAX = 0x17AE7F57C671EA7D) and passed through
>
> And how exactly does that happen?
>
> 0x17AE7F57C671EA7D = 1.70644e+18 nsec
> = 1.70644e+09 sec
> = 2.84407e+07 min
> = 474011 hrs
> = 19750.5 days
> = 54.1109 years
>
> That's the real issue, not what you are trying to 'fix' in timespec_add_ns()

And that's caused by stopping the whole machine for 20 minutes. It violates
the assumption of the timekeeping core, that the maximum time which is between
two updates of the core is < 5-10min. So that insane large number is caused by a
mult overrun when converting the time delta to nanoseconds.

You can find that limit via:

# dmesg | grep tsc | grep max_idle_ns
[ 5.242683] clocksource tsc: mask: 0xffffffffffffffff max_cycles: 0x21139a22526, max_idle_ns: 440795252169 ns

So on that machine the limit is:

440795252169 nsec
440.795 sec
7.34659 min

And before you ask or come up with patches: No, we are not going to add
anything to the core timekeeping code to work around this limitation simply
because its going to add overhead to a performance sensitive code path for a
very limited value.

Keeping a machine stopped for 20 minutes will make a lot of other things
unhappy, so introducing a 'fix' for that particular issue is just silly.

Thanks,

tglx