RE: [PATCH 7/8] zswap: add to mm/

From: Dan Magenheimer
Date: Fri Jan 04 2013 - 13:47:40 EST


> From: Dave Chinner [mailto:david@xxxxxxxxxxxxx]
> Subject: Re: [PATCH 7/8] zswap: add to mm/

Hi Dave --

Thanks for your continued helpful feedback and expertise!

> > Given the above, do you think either compressed-anonymous-pages or
> > compressed-pagecache-pages are suitable candidates for the shrinker
> > infrastructure?
>
> I don't know all the details of what you are trying to do, but you
> seem to be describing a two-level heirarchy - a pool of compressed
> data and a pool of uncompressed data, and under memory pressure are
> migrating data from the uncompressed pool to the compressed pool. On
> access, you are migrating back the other way. Hence it seems to me
> that you could implement the process of migration from the
> uncompressed pool to the compressed pool as a shrinker so that it
> only happens as a result of memory pressure....

I suppose that would be an option, but the current triggers
for compression are: (for anonymous pages) the decision by
the MM subsystem to swap-out a specific page; and (for
pagecache pages) the decision by the MM subsystem to reclaim
a specific pagecache page. This is all handled by the cleancache
and frontswap APIs/hooks that Linus merged at 3.0/3.5.

This approach leveraged all the existing MM mechanisms to
ensure that all existing memory pressure valves are honored
unchanged, and also ensures that MM has selected the lowest
priority pages (and thus presumably the pages least likely
to be directly addressed soon).

You're correct that the normal trigger for decompression is
access, but this is handled through frontswap/cleancache
hooks in the existing pagefault paths. So this also honors
all existing memory pressure mechanisms.

So, it is the "abnormal" decompression triggers that we are
mostly exploring here: For anonymous pages, we reach a point
where zcache/zswap is "full" and we wish we would have used
the swap disk for the LRU pages... so we need to decompress
some pages and move them to the "real" swap device. And for
pagecache pages, we somehow determine that we need to throw
away some zpages, and we'd like to throw away as few zpages
as possible (preferably in some kind of LRU order), while
freeing up as many wholepages as possible.

This last is the only current (feebly attempted) use of the
shrinker API.

> > Note that compressed anonymous pages are always dirty so
> > cannot be "reclaimed" as such. But the mechanism that Seth
> > and I are working on causes compressed anonymous pages to
> > be decompressed and then sent to backing store, which does
> > (eventually, after I/O latency) free up pageframes.
>
> The lack of knowledge I have about zcache/zswap means I might be
> saying something stupid, but why wouldn't you simply write the
> uncompressed page to the backing store and then compress it on IO
> completion? If you have to uncompress it for the application to
> either modify the page again or write it to the backing store,
> doesn't it make things much simpler if the cache only holds clean
> pages? And if it only holds clean pages, then another shrinker could
> be used to keep the size of it in check....

A good point, and this is actually already implemented as an option.
(See frontswap_writethrough_enabled.) But it has the unfortunate
side effect of generating a lot of swap-disk write traffic that,
in many circumstances, could have been completely avoided.
For some reason, performance also sucked... though that was
never investigated so may have been some silly bug and we should
revisit it.

> > In your opinion,
> > then, should they be managed by core MM, or by shrinker-controlled
> > caches, by some combination, or independently of either?
>
> I think the entire MM could be run by the shrinker based reclaim
> infrastructure. You should probably have a read of the discussions
> in this thread to get an idea of where we are trying to get to with
> the shrinker infrastructure:
>
> https://lkml.org/lkml/2012/11/27/567
>
> (Warning: I don't say very nice things about the zcache/ramster
> shrinkers in that patch series. :/ )

Heh. No offense taken. I hope your brain has recovered and you
managed to avoid tearing out your eyeballs. That code was definitely
not ready for primetime and not really even ready for staging,
but had to be published due to various unfortunate circumstances.

If you have suggestions for other improvements (in addition
to your broader patchset), we would be eager for your help!

> > Can slab today suitably manage "larger" objects that exceed
> > half-PAGESIZE? Or "larger" objects, such as 35%-PAGESIZE where
> > there would be a great deal of fragmentation?
>
> Have a look at how the kernel heap is implemented:
>
> <snip>
>
> i.e. it's implemented as a bunch of power-of-2 sized slab caches,
> with object sizes that range up to 4MB. IIRC, SLUB is better suited
> to odd sized objects than SLAB due to it's ability to have multiple
> pages per slab even for objects smaller than page sized......

Hmmm... I was unclear. *All* objects (aka zpages) stored by zcache/zswap
are less than PAGESIZE, and a large percent are between PAGESIZE/2
and PAGESIZE, and a large percent are between PAGESIZE/3 and PAGESIZE/2.
I don't believe slab (or slub or kmalloc) can handle these efficiently
without significant fragmentation, though it may be my poor understanding
of slab/slub.

> > If so, we should definitely consider slab as an alternative
> > for zpage allocation.
>
> Or you could just use kmalloc... ;)
>
> As I said initially - don't think of whether you need to use slab
> allocation or otherwise. Start with simple allocation, a tracking
> mechanism and a rudimetary shrinker, and then optimise allocation and
> reclaim once you understand the limitations of the simple
> solution....

Indeed. I think that's where we are at... optimising the reclaim
now that we understand the limitations of the rudimentary (eye-clawing-out)
shrinker. Please help if you have ideas!

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/