Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

From: John Stultz
Date: Wed Jan 20 2016 - 12:59:39 EST


On Wed, Jan 20, 2016 at 9:42 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Wed, 20 Jan 2016, John Stultz wrote:
>> Ehrm. A more productive route in solving this might be to cap the
>> cycle delta we return from timekeeping_get_delta().
>>
>> We already do this in the CONFIG_DEBUG_TIMEKEEPING, but adding a
>> simple check it to the non-debug case should be doable w/o adding too
>> much overhead to this very hot path.
>>
>> Something like:
>> if (delta > tkr->clock->max_cycles)
>> delta = tkr->clock->max_cycles;
>>
>> return delta;
>
> Well, you can make CONFIG_KDB select CONFIG_DEBUG_TIMEKEEPING.

True. And turning on DEBUG_TIMEKEEPING is probably the easiest thing
for Jeff to try.

Though, there's still the same issue w/ paused VMs. Most of the design
for the timekeeping code has been that it can't properly function if
you block update_wall_time() calls, but it shouldn't kill the box.
With most clocksources, the issue is the counter wraps and we lose
time. But in this case with the TSC its the *very* large cycle delta
turning into a unexpectedly large nanosecond value.

Hrm.. I do also wonder: the logarithmic accumulation chews through
large cycle deltas efficiently, but it does have some design limits,
so it might also hit the rails and take awhile to spin accumulating
time with such large offsets.

Jeff: Can you try the config option above to let me know if that
avoids the issue? And if not, can you provide some analysis of what
else is going on?

thanks
-john