Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

From: Jeff Merkey
Date: Wed Jan 20 2016 - 12:16:34 EST


On 1/20/16, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
> On 1/20/16, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> On Tue, 19 Jan 2016, Jeff Merkey wrote:
>>> Nasty bug but trivial fix for this. What happens here is RAX (nsecs)
>>> gets set to a huge value (RAX = 0x17AE7F57C671EA7D) and passed through
>>
>> And how exactly does that happen?
>>
>> 0x17AE7F57C671EA7D = 1.70644e+18 nsec
>> = 1.70644e+09 sec
>> = 2.84407e+07 min
>> = 474011 hrs
>> = 19750.5 days
>> = 54.1109 years
>>
>> That's the real issue, not what you are trying to 'fix' in
>> timespec_add_ns()
>>

I guess I am going to have to become an expert on the timekeeper and
learn this subsystem backwards and forwards to code a touch function
to keep it from crashing the system.

On the 2.6 series kernels (and 2.2) this problem did not exist. I
noticed a lot of these changes came in in the late 2.6 cycles. Before
that time, I could leave the debugger spinning for days and linux
worked fine.

For people who have to pay developers to develop code on Linux a
debugger is almost
an essential tool since it saves hundreds of thousands of dollars in
development costs. Not everyone wants to spend money for their
employees and engineers to sit around and code review every problem -
customers just want their problems fixed -- and fast. That being
said, I am having no lack of people who download and use this debugger
and I'm certain kgdb is heavily used by folks doing development. If
kernel development is too hard, people move to something else based on
simple economics.

That being said, I need to get this fixed. There is no good reason a
debugger shouldn't be able to stop the system and leave it suspended
for days if necessary to run down a bug. I wrote a debugger on SMP
Netware that worked that way. The earliest versions of MDB worked
that way.

kgdb is broken right now because of this. I am not certain it affects
all systems out there, but it needs to be fixed.

If you have any ideas on how to code a touch function please send me a
patch or suggest how it could be done non-obstrusively, otherwise I'll
have to dive into the timekeeper and fix it myself and learn yet
another subsystem of Linux and fix it bugs. A code subsystem that
crashes because the timer tick is skewed or returns garbage is poorly
designed IMHO.

It should have either a touch function to keep it updated, or have the
ability to recover.

Jeff