Re: [Bug #13058] First hibernation attempt fails

From: Rafael J. Wysocki
Date: Fri Apr 17 2009 - 16:35:24 EST


On Friday 17 April 2009, Linus Torvalds wrote:
>
> On Fri, 17 Apr 2009, Jens Axboe wrote:
> >
> > Given the somewhat odd nature of the bug and the requirements to trigger
> > it, how confident are you in the bisection results?
>
> I suspect it's timing-dependent.
>
> The failure case is a ENOMEM returned from the "echo disk > /sys/power/state",
> and sadly there are a _lot_ of potential sources of ENOMEM's in the path.
> And a numbe of them come from GFP_ATOMIC allocations etc.
>
> Now, that explains why it only happens while in X (more memory being
> used), and also why it succeeds the second time (the first try will have
> triggered VM activity and then free'd the pages it allocated up to that
> point).
>
> IOW, I bet it would work on the first try if you were to just run
> something like
>
> ptr = malloc(BIGNUM);
> memset(ptr, 0, BIGNUM);
> exit(0);
>
> first - just to make room for stuff.
>
> And the thing is, swsusp_save() really does do odd things. For example, to
> get rid of unnecessary memory, it does "drain_local_pages()", where the
> "local" is "local cpu". Why does it do that? Likely nobody knows.
>
> Now, that won't matter in Alan's case (he is UP), but the point is, the
> swsuspend code does these random things to try to free up memory, and I
> suspect it's mostly been a trial-and-error thing. And then subtle changes
> in memory usage when allocating or writing things out will change things.
>
> For example, there is a magic "PAGES_FOR_IO" #define, which is somewhat
> arbitrarily set to 4MB worth of pages. Where did that number come from?
> Who knows? But that's the number the code uses for the _initial_ check of
> "do we have enough memory" (the one that must have passed, since it
> actually started doing things and didn't print out a warning message).
>
> Anyway, from the dmesg, we can see:
>
> [ 41.873619] PM: Shrinking memory... Restarting tasks ... done.

Ah, thanks for pointing this out to me!

> and this is a clear indication that it's "swsusp_shrink_memory()" that
> failed. If it had succeeded, you'd have seen
>
> PM: Shrinking memory... done (xyz pages freed)
>
> but it returned an error case, and then the suspend fails and starts
> restarting tasks.

AFAICS, there's only one possible situation in which that can happen,
which is when shrink_all_memory() returns 0 and there was the assumption
that this could not happen unless there _really_ was no memory to free.
Apparently, that has recently changed and it is now possible that
shrink_all_memory() returns 0, even though there still is some memory to free.

At the moment I don't see what change caused that to happen, but shouldn't we
put .nr_reclaimed = 0 in the definition of sc in shrink_all_memory()?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/