Re: clocksource mutex deadlock, cat current_clocksource(2.6.33-rc6/7)

From: Andreas Mohr
Date: Mon Feb 08 2010 - 04:37:26 EST


On Mon, Feb 08, 2010 at 10:13:00AM +0100, Thomas Gleixner wrote:
> On Mon, 8 Feb 2010, Andreas Mohr wrote:
> >
> > And then a cat current_clocksource managed to hang again.
>
> Well, that's not surprising at all. If one task is stuck on clocksource_mutex,
> then the next one will be stuck as well.

I believe here you are pointing at the initial bootup acpi_pm lockup which NMI
watchdog detected. And not some thought that I somehow simply executed
cat current_clocksource twice, given my wording which might erroneously
hint at that.


So you'd think that we have a clocksource_mutex problem even before
the initial bootup switch to acpi_pm?

However I don't see how this could be the case, given that in some instances
boot does continue after acpi_pm selection, albeit after a delay.

> > (NOTE that the - now complete! - SysRq-T list does NOT show any backtraces
> > of kwatchdog any more, only many other processes)
> > Could it be that the (rather disruptive) NMI watchdog confuses the current state at
> > change_clocksource and causes that stuff to get left with
> > clocksource_mutex remaining taken?
>
> Nope, the NMI watchdog is not involved. It merily tells us that the
> task is stuck.

OK.
And after that message debug_locks is zeroed and kwatchdog is gone
from the process list (probably during debug_locks change).



I still can't make much reason of this behaviour.
If we have a problem during acpi_pm selection on boot, then by all
accounts it should get stuck completely (plus yielding watchdog's lockup
message), not continue booting after some weird delay.
OK, this particular delay phenomenon could still be explained
by a pretty severe contention,
but then after successfully having gotten the mutex it certainly
shouldn't happen that then even after bootup the mutex remains blocked
(as witnessed by the cat current_clocksource issue).
After all all mutex use is fully symmetric, no surprises IMHO - unless
NMI / SMI or so are involved.



I'll explain what I think might be happening:
bootup switches to acpi_pm, timekeeping gets borked, NMI watchdog complains
due to timekeeping issues, brutally yanks the waiting acpi_pm switchover
(thereby NOT releasing clocksource_mutex),
and the result is that I have cat current_clocksource stuck in my userspace.

Andreas Mohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/