Re: 2.6.23-rc6-mm1: failure to boot on HP nx6325, no sound when booted, USB-related WARNING

From: Rafael J. Wysocki
Date: Thu Sep 20 2007 - 17:32:28 EST


Thomas,

On Thursday, 20 September 2007 23:08, Thomas Gleixner wrote:
> Rafael,
>
> On Thu, 2007-09-20 at 22:39 +0200, Rafael J. Wysocki wrote:
> > > Works as well. What's the difference between this and the real thing ?
> >
> > The real thing also calls device_power_down(PMSG_FREEZE), which is a
> > counterpart of sysdev_shutdown(), more or less, and I think that's what goes
> > belly up.
> >
> > You can use the patch below (on top of -rc6-mm1), which just disables the image
> > creation (that should be irrelevant anyway) and see what happens.
>
> In meantime I figured out what's happening. The ordering in
> hibernate_snapshot() is wrong. It does:
>
> swsusp_shrink_memory();
> suspend_console();
> device_suspend(PMSG_FREEZE);
> platform_prepare(platform_mode);
>
> disable_nonboot_cpus();
>
> swsusp_suspend();
>
> enable_nonboot_cpus();
>
> platform_finish(platform_mode);
> device_resume();
> resume_console();
>
> We disable everything in device_suspend()

No, we don't. sysdevs are _not_ suspended in device_suspend().
They are suspended in device_power_down(), which is called
_after_ disable_nonboot_cpus() (from swsusp_suspend()).

> including timekeeping,

No, the timekeeping is suspended in device_power_down() (or at least it should
be).

> so any code which is depending on working timekeeping and timer
> functionality (which is suspended in timekeeping_suspend() as well) is
> busted.
>
> enable_nonboot_cpus() definitely relies on working timekeeping and
> timers depending on the codepath. It's just a surprise that this did not
> blow up earlier (also before clock events).
>
> I changed the ordering of the above to:
>
> disable_nonboot_cpus();
>
> swsusp_shrink_memory();
> suspend_console();
> device_suspend(PMSG_FREEZE);
> platform_prepare(platform_mode);
> swsusp_suspend();
> platform_finish(platform_mode);
> device_resume();
> resume_console();
>
> enable_nonboot_cpus();

Actually, we can't do this here, because of ACPI and some interrupt handling
related problems. Unfortunately, platform_finish() needs to go _after_
enable_nonboot_cpus() and device_resume() needs to go after platform_finish().
Analogously, disable_nonboot_cpus() has to go after platform_prepare().

Otherwise, some systems will break.

> and non-surprisingly the "my VAIO needs help from keyboard" problem went
> away immediately. See patch below. (on top of rc7-hrt1, -mm1 does not
> work at all on my VAIO due to some yet not identified wreckage)

Hm, I really don't know why it helps, but that's not because of the timekeeping
suspend, IMO.

> I did not yet look into the suspend to ram code, but I guess that there
> is an equivalent problem.

Yes, the code ordering is the same, but it's not totally wrong, IMHO.

> But I have no idea why this affects Andrews jinxed VAIO (UP machine),
> though I suspect that we have more timekeeping/timer depending code
> somewhere waiting to bite us.

That's possible.

> Also I still need to debug why the HIBERNATION_TEST code path (which has
> a msleep(5000) in it) does not fail,

See above. :-)

> but I postpone this until tomorrow morning. I'm dead tired after hunting
> this Heisenbug which changes with every other printk added to the code.
> I'm going to add some really noisy messages for everything which accesses
> timekeeping / timers _after_ those systems have been shut down.
>
> We really need to fix this once and forever _before_ 2.6.23 final, even
> if it requires a -rc8.

Agreed.

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/