[RFC] swsusp: brainstorming on a freaked-out approach (was: Re: [RFC/RFT] swsusp: image size tunable)

From: Christian Trefzer
Date: Sat Dec 17 2005 - 11:47:07 EST


Hello there,

On Fri, Dec 16, 2005 at 10:09:52PM +0100, Rafael J. Wysocki wrote:
> On Friday, 16 December 2005 20:29, Christian Trefzer wrote:
> > My point is, that the generic algorithm to determine the maximum image
> > size desired by the user will, as a whole, remain the same, whether
> > implemented in kernel or user space. Unfortunately, I am very unfamiliar
> > with swsusp code and all of its implications wrt. drivers etc. - this
> > still holds true after looking at swsusp_shrink_memory().
> >
> > But generally speaking: at some point, memory is freed for the image
> > creation process. Recent efforts towards tunable max_image_size make it
> > seem to me that we likely know beforehand, how much we are going to
> > free. Now say we have max_image_size of 500MB, but we know that active
> > swaps will hold 400MB at max. So we start out freeing enough memory for
> > a 400MB image, and once we're done, we notice we can only save 350MB.
> > Does anything keep us from reducing the image size to 350MB right now?
>
> Basically nothing and it could be implemented, but IMO it would be a step
> in a wrong direction.
>
> Namely, we want the image-writing code to go to the user space in the long
> run. Then, the whole suspend will be handled by a userspace process and the
> kernel will not know what kind of storage this process is using (it may or
> may not be a swap partition). Consequently, the kernel will not know how
> much space there is for the image, _unless_ the user space process tells it.

So far everything seems clear to me.

> The user space process can do this _once_, using /sys/power/image_size
> before it asks the kernel to create the image. Next, the kernel creates
> the image and swsusp_shrink_memory() is called in the process. Of course
> at this point the amount of available storage can change, but to know it
> the kernel would have to ask the userspace process if that happened.
> Generally speaking it can't do this, so it "returns" the image to the
> userspace process which should verify if the image is small enough
> and if not, repeat from writing /sys/power/image_size with a smaller
> number or just fail (if so configured).

I see now - when I was thinking about uswsusp, I wrongly guessed that
image creation will happen in the same region (user vs. kernel space) as
the checks for available storage. We can't really let image_size
converge towards free_storage with an upper limit of max_image_size
anyway, yet the silent retry with a smaller image size will eventually
be an option, although configurable via the userland utility. The latter
was what I intended, for checking available storage in every loop
iteration would indeed not be very efficient. Nice one!

Just a spontaneous idea: before we smash memory free with a sledge
hammer (no offense intended), would it be sensible to create sort of a
map of actual memory usage, determining the importance (kernel/user
apps/buffers) and relevance (frequently/recently/not used) of pages?

With a long-term target of a responsive post-resume system in mind, it
seems desirable to me to restore as much information as possible.
Putting pressure on the VM will (afaik) free buffers first, but swap out
applications later. If i am not misunderstood once more, that means
swsusp will mostly trash some cache, at least while freeing memory, yet
not push applications out to swaps. Obviously, the process of suspending
and resuming will require some memory by itself, and this amount will
grow the more functionality is added.

I hereby assume that the "underlying" functionality of reserving space
for suspend and resume routines, as well as "freezing everything" for
image creation and "restoring everything", basically "just works". That
means, from my naive point of view, that "time has stopped" wrt. the
"running" kernel and applications, leaving alive only the means to write
the image and swaps. Now it is "only" a matter of what to write to the
image, what to swap out, and what to forget.

Basically, importance of pages is pretty boolean - there are pages we
can throw away without harm, and there are those we can't. Then again,
some of the latter can be swapped if necessary, but a reluctance to do
so would improve system responsiveness post-resume. Unused application
data will eventually be swapped out already by the time we want to
suspend, yet with today's common memory size and current kernel
versions, this assumption becomes less and less valid. Therefore, I
assume again that application data is mostly paged in by the time
suspend is triggered.

So far my picture of the situation - any misunderstandings in there
might flaw the following suggestions.


My basic idea was to create the image incrementally, using a variable
slice size, in which important pages come before unimportant ones and
hot pages before cold ones. I bet the kernel/uswsusp interface for that
mechanism can become arbitrarily complex, but to keep it simple, would
it be possible to request one image slice at a time? Say, request and
write to the image all pages related to kernel space, then user apps,
then caches, ordered by relevance ("temperature").


* Looking at a case of extreme storage constraints:
If not even kernel and userland can be saved during susped, we fail.
If the image is stored anywhere but in swaps, we can try to swap out
pages [1] - once all the necessary areas are safe, we are as well, else
we fail. Caches can be forgotten if necessary, but are nice to have
- keep as much as possible/desired.


* In case of huge storage amounts for suspend:
Save the kernel first, then userland, then caches until we reach a
user-defined limit. The ordering of the latter two could be mixed
slice-wise according to page relevance, forcing some LRU/LFU
application pages out to swap in favour of keeping larger caches
[1].


* Anywhere inbetween / basic algorithm:
- Create a map of memory usage.
- Group data by importance: kernel | userland | buffers/caches
- Group equivalently relevant ("hot") pages in slices. Use
histogram-like maths to determine slice boundaries and stepping. [2]
- Determine the ordering of slices by weight, with a threshold of
importance vs. relevance. [3]
- Request and store one slice at a time:

- Slice is unswappable and thus must be stored in the suspend
image. Lack of storage implies suspend failure.
- Slice does not fit into suspend image, but is swappable [1].
Lack of swap space leads to suspend failure. Otherwise, we
instruct the kernel to swap out the pages contained within that
specific slice. The fact that these pages are swapped out must
be known to the VM before resuming the frozen kernel [4].
- Slice contains buffer data which can be silently dropped, yet
the fact that these buffers are gone must also make its way into
the VM state before we let the frozen kernel continue where it
has left off [4]



[1] Aspects of storing pages in swap:
A basic approach to guarantee integrity would be to make all
important pages backed by swap. This is contraproductive wrt. swap
as storage for images which can be compressed in the long run, so
that is a no-go.

It would also require swap to be at least as large as the amount of
memory currently occupied by applications, which won't necessarily
be the case, esp. if the suspend image will be written to a
dedicated partition or even an ordinary file. My conclusion is that
swapping out a lot of pages is only feasible in case the suspend
image is stored "out-of-swap".



[2] Thoughts on histograms:
Small slices induce more overhead, large slices are more likely to
exceed the maximum possible/desired image size. It therefore appears
sensible to me to stick to dynamically sized slices, where
horizontal size (amount of memory) and vertical size (relevance
range) should represent a "significant" histogram.

That said, we would end up having a suitable number of slices, with
each bearing a sane amount of data, esp. wrt. cache slices being
stored or dropped. I yet have to draw some draft charts and graphs
to visualize some relations between different sizes and
thresholds...



[3] Page weight / coloring:
Looking at [1], we can skip coloring of important pages if the image
is written to swaps as they will all need to be stored. In this
case, we can as well treat kernel space and userland as one slice
each, or even squash them together into one.

In case of external suspend image and buffer pages, we could color
pages according to their relevance, and after sorting them we have a
gradient from, say, red (hot) to blue (cold) pages. Using some sane
histograph techniques [2] we can now slice that up, ending up with,
well, slices we can treat "atomically".



[4] Changing VM state after freezing it:
This occured to me pretty late during this draft and is, of course,
crucial to all of it ; ) Murphy rules. So anyway, if we kind of
"tar" our memory up, one slice at a time, how can we tell the frozen
kernel's VM, "Hey, by the way, there were some pages of cache over
there, but we decided to drop them during suspend!" Even more
complex at first sight appears to me the treatment of pages swapped
out due to lack of suspend storage, external suspend image presumed.
This would require the already written kernel slices to be re-made.

After all, the one-slice-at-a-time approach is a great
simplification for illustrating the line of thought, but not for
writing a consistent image. The whole point of slicing the image up
is a calculation of what pages to store where (or at all), given an
arbitrary configuration wrt. memory, available/desired storage/swap
and throughput constraints. So instead of writing slices
immediately, we take everything into consideration to achieve an
optimum result, and once it is clear what shall be done, it is done.

The one big problem I see here is the eventual increase in memory
requirements to do all the maths, and potentially the need for free
space to create an atomic copy of some special pages. Also, "late
swapping" of slices stands in opposition to a totally frozen state
of the previously running environment. We therefore need to
determine which pages to swap out, order the VM system to do so
(without clearing out buffers), take into account all changes caused
by these actions and re-iterate/backtrack as few steps as necessary
to act appropriately. If we can rely on a partially frozen kernel to
only touch the pages it is told to swap out, the only further
changes should be visible within the page tables. Complexity results
only if such changes can shift slice boundaries, leaving at least
one question open to the VM gods: is this possible, and how
predictable is the resulting beaviour?



Well, so much for "quick" brainstorming on the issue... Don't bother
flaming me for any misunderstanding or misconception : )

> > Apologies for wasting your time, I am trying to get rid of my ignorance.
>
> It's not a waste of time at all. :-) IMO the issue is definitely worth
> discussing.

Thanks for clearing my conscience, yet I am afraid I just stained it
even more : )

Yours,
Chris

Attachment: pgp00000.pgp
Description: PGP signature