Re: Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unableto handle kernel NULL pointer dereference at 0000000000000598 1.478005]IP: [<ffffffff8107a6c4>] queue_work_on+0x4/0x30

From: Stefan Bader
Date: Wed Jan 04 2012 - 10:12:10 EST


This is a multi-part message in MIME format.-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 04.01.2012 02:20, NeilBrown wrote:
> On Tue, 03 Jan 2012 16:53:00 -0800 John Stultz <john.stultz@xxxxxxxxxx>
> wrote:
>
>> On Wed, 2012-01-04 at 11:31 +1100, NeilBrown wrote:
>>> On Tue, 03 Jan 2012 15:09:48 -0800 John Stultz <john.stultz@xxxxxxxxxx>
>>> wrote:
>>>>> From the stack trace, we've kicked off a rtc_timer_do_work,
>>>>> probably
>>>> from the rtc_initialize_alarm() schedule_work call added in Neil's
>>>> patch. From there, we call __rtc_set_alarm -> cmos_set_alarm ->
>>>> cmos_rq_disable -> cmos_checkintr -> rtc_update_irq ->
>>>> schedule_work.
>>>>
>>>> So, what it looks to me is that in cmos_checkintr, we grab the
>>>> cmos->rtc and pass that along. Unfortunately, since the cmos->rtc
>>>> value isn't set until after rtc_device_register() returns its null at
>>>> that point. So your patch isn't really fixing the issue, but just
>>>> reducing the race window for the second cpu to schedule the work.
>>>>
>>>> Sigh. I'd guess dropping the schedule_work call from
>>>> rtc_initialize_alarm() is the right approach (see below). When
>>>> reviewing Neil's patch it seemed like a good idea there, but it seems
>>>> off to me now.
>>>>
>>>> Neil, any thoughts on the following? Can you expand on the condition
>>>> you were worried about in around that call?
>>>
>>> If you set an alarm in the future, then shutdown and boot again after
>>> that time, then you will end up with a timer_queue node which is in the
>>> past.
>>
>> Thanks for explaining this again.
>>
>> Hrm. It seems the easy answer is to simply not add alarms that are in the
>> past. Further, I'm a bit perplexed, as if they are in the past, the
>> enabled flag shouldn't be set. __rtc_read_alarm() does check the
>> current time, so maybe we can make sure we don't return old values? I
>> guess I assumed __rtc_read_alarm() avoided returning stale values, but
>> apparently not.
>
> That would probably be a more robust approach. Also it might make sense
> to clean out old alarms whenever we are about to add a new one.
>
>>
>>> When this happens the queue gets stuck. That entry-in-the-past won't
>>> get removed until and interrupt happens and an interrupt won't happen
>>> because the RTC only triggers an interrupt when the alarm is "now".
>>>
>>> So you'll find that e.g. "hwclock" will always tell you that 'select'
>>> timed out.
>>>
>>> So we force the interrupt work to happen at the start just in case.
>>
>> Unfortunately its too early.
>>
>>> Did you see my proposed patch which converted those calls to do the
>>> work in-process rather than passing it to a worker-thread? I think
>>> that is a clean fix.
>>
>> I don't think I saw it today. Was it from before the holidays?
>
> About 4 hours ago: Subject: Re: Patch Upstream: rtc: Expire alarms after
> the time is set.
>
>>
>> Even so, at this point, I don't know if we have enough time for testing,
>> so I'm thinking we either just drop the problematic sched_work call or
>> revert the whole thing and try again for 3.3
>
> I wouldn't object to that. The bug only triggers in unusual circumstances
> and is quite easy to work around so it is safer to wait until we have a
> really good fix.
>
> Thanks, NeilBrown
>
Neil, this variation would also keep the kernel from crashing in my test
environment. What do you think of that?

- -Stefan

>
>>
>> thanks -john
>>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBCgAGBQJPBGw4AAoJEOhnXe7L7s6jg5YQAJ1PrNGhuXJSAevurIorc+l5
Xbyj96MRk2Z7y7c0f/OQCCKzmReNMOQ91pgok2FIamHBh55JtbgdldyTWG80Jr0M
N1jwkus1SJTIcLAeIrzoXSLU5uTvmcCcW0GBUuK98m0LDwMMJr0P1a5ilsyorRFY
ekdFIL732Cs6hn3BAlg6zVucG/kMOi3OEOtHlbKdw83VdxJ1OmjIkW9tlj/6268j
ZImuXh1XZmCJiTuCwdvdi3RVU3qZvQJi1AW0G0zsQcc2r13OQkSSoi/Wo5+6dNit
7WxSbYqeZ4hPmk67yDKjOK67kDU094end++ePtQNke0d9IePB3fRvDIi+eEkrjc5
0rJi2gLkpSyE8MJ3pBO9Bt8JCVjpY1QZnQsDT68fIHUApMnrwRK2N371JzbSim52
8VYgTfubRRdwiLlMAL7ACilFV2FKc9PcTf7FrrR2j+GZfhKTD5iy+bNvqpWaJVVT
jUrfCS79z+Z9v9/bjZ55n23gIJQn3jLk3B79plnFlwmxp9CDaAITmfzKsSB0EoZB
RF0ndDkd1Oii/XuFdqaVaMeZojPCT2rVHas2H26mH+Ihb0n9EvSBgx2L64DbRrAt
duLngf+IAO6T9ngaqnhmHwJXdwKiKGxUhWVdo92YTd16jdANd/711/B0kzHHWw4H
FT46+Os2UQJR+HsedKZg
=Kgpc
-----END PGP SIGNATURE-----