Re: [RFC] mm: remove swapcache page early

From: Minchan Kim
Date: Wed Mar 27 2013 - 20:36:18 EST

Next message: Stephen Rothwell: "linux-next: manual merge of the hid tree with Linus' tree"
Previous message: Rafael J. Wysocki: "Re: [LKP] Commit ac3ebafa81a makes NHM EX/EP machines hung out since 3.9-rc1"
In reply to: Hugh Dickins: "Re: [RFC] mm: remove swapcache page early"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Hugh,

On Wed, Mar 27, 2013 at 02:41:07PM -0700, Hugh Dickins wrote:
> On Wed, 27 Mar 2013, Minchan Kim wrote:
>
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
> so we can avoid unnecessary write.
> >
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
>
> That is a very good realization: it's surprising that none of us
> thought of it before - no disrespect to you, well done, thank you.
>
> And I guess swap readahead is utterly unhelpful in this case too.
>
> >
> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
>
> But I strongly disagree with almost everything in your patch :)
> I disagree with addressing it in vm_swap_full(), I disagree that
> it can be addressed by device, I disagree that it has anything to
> do with SWP_SOLIDSTATE.
>
> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> is it? In those cases, a fixed amount of memory has been set aside
> for swap, and it works out just like with disk block devices. The

Brd is okay but it seems you are miunderstanding zram.
The zram doesn't reserve any memory and allocate dynamic memory when
swap out happens so it can make duplicate space in pusdo block device
and memory.

> memory set aside may be wasted, but that is accepted upfront.
>
> Similarly, this is not a problem with swapping to SSD. There might
> or might not be other reasons for adjusting the vm_swap_full() logic
> for SSD or generally, but those have nothing to do with this issue.

Yes.

>
> The problem here is peculiar to frontswap, and the variably sized
> memory behind it, isn't it? We are accustomed to using swap to free

Zram, too.

> up memory by transferring its data to some other, cheaper but slower
> resource.
>
> But in the case of frontswap and zmem (I'll say that to avoid thinking

Frankly speaking, I couldn't understand what you means, frontswap and zmem.
The frontswap is just layer for hook the swap subsystem.
Real instance of frontswap is zcache and zswap at the moment.
I will understand them as zcache and zswap. Okay?

> through which backends are actually involved), it is not a cheaper and
> slower resource, but the very same memory we are trying to save: swap
> is stolen from the memory under reclaim, so any duplication becomes
> counter-productive (if we ignore cpu compression/decompression costs:
> I have no idea how fair it is to do so, but anyone who chooses zmem
> is prepared to pay some cpu price for that).

Agree.

>
> And because it's a frontswap thing, we cannot decide this by device:
> frontswap may or may not stand in front of each device. There is no
> problem with swapcache duplicated on disk (until that area approaches
> being full or fragmented), but at the higher level we cannot see what
> is in zmem and what is on disk: we only want to free up the zmem dup.

That's what I really have a concern and why I begged idea.

>
> I believe the answer is for frontswap/zmem to invalidate the frontswap
> copy of the page (to free up the compressed memory when possible) and
> SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> (setting page dirty so nothing will later go to read it from the
> unfreed location on backing swap disk, which was never written).

You mean that zcache and zswap have to do garbage collection by some
policy? It could be but how about zram? It's just pseudo block device
and he don't have any knowledge on top of it. It could be swap or normal
block device. I mean zram has no information of swap to handle it.

>
> We cannot rely on freeing the swap itself, because in general there
> may be multiple references to the swap, and we only satisfy the one
> which has faulted. It may or may not be a good idea to use rmap to
> locate the other places to insert pte in place of swap entry, to
> resolve them all at once; but we have chosen not to do so in the
> past, and there's no need for that, if the zmem gets invalidated
> and the swapcache page set dirty.

Yes it could be better but as I mentioned above, it couldn't handle
zram case. If there is a solution for zram, I will be happy. :)

And another point, fronstwap is already percolated into swap subsystem
very tightly. So I doubt adding one another hook is a really problem.

Thanks for great comment, Hugh!

>
> Hugh
>
> > So let's add Ccing Shaohua and Hugh.
> > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> > or something for z* family.
> >
> > Other problem is zram is block device so that it can set SWP_INMEMORY
> > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> > I have no idea to use it for frontswap.
> >
> > Any idea?
> >
> > Other optimize point is we remove it unconditionally when we
> > found it's exclusive when swap in happen.
> > It could help frontswap family, too.
> > What do you think about it?
> >
> > Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> > Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
> > Cc: Seth Jennings <sjenning@xxxxxxxxxxxxxxxxxx>
> > Cc: Nitin Gupta <ngupta@xxxxxxxxxx>
> > Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>
> > Cc: Shaohua Li <shli@xxxxxxxxxx>
> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > ---
> > include/linux/swap.h | 11 ++++++++---
> > mm/memory.c | 3 ++-
> > mm/swapfile.c | 11 +++++++----
> > mm/vmscan.c | 2 +-
> > 4 files changed, 18 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2818a12..1f4df66 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
> > extern atomic_long_t nr_swap_pages;
> > extern long total_swap_pages;
> >
> > -/* Swap 50% full? Release swapcache more aggressively.. */
> > -static inline bool vm_swap_full(void)
> > +/*
> > + * Swap 50% full or fast backed device?
> > + * Release swapcache more aggressively.
> > + */
> > +static inline bool vm_swap_full(struct swap_info_struct *si)
> > {
> > + if (si->flags & SWP_SOLIDSTATE)
> > + return true;
> > return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
> > }
> >
> > @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> > #define get_nr_swap_pages() 0L
> > #define total_swap_pages 0L
> > #define total_swapcache_pages() 0UL
> > -#define vm_swap_full() 0
> > +#define vm_swap_full(si) 0
> >
> > #define si_swapinfo(val) \
> > do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 705473a..1ca21a9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > mem_cgroup_commit_charge_swapin(page, ptr);
> >
> > swap_free(entry);
> > - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> > + if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page))
> > + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
> > try_to_free_swap(page);
> > unlock_page(page);
> > if (page != swapcache) {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 1bee6fa..f9cc701 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -293,7 +293,7 @@ checks:
> > scan_base = offset = si->lowest_bit;
> >
> > /* reuse swap entry of cache-only swap if not busy. */
> > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > int swap_was_freed;
> > spin_unlock(&si->lock);
> > swap_was_freed = __try_to_reclaim_swap(si, offset);
> > @@ -382,7 +382,8 @@ scan:
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > + if (vm_swap_full(si) &&
> > + si->swap_map[offset] == SWAP_HAS_CACHE) {
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > @@ -397,7 +398,8 @@ scan:
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > + if (vm_swap_full(si) &&
> > + si->swap_map[offset] == SWAP_HAS_CACHE) {
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > @@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry)
> > * Also recheck PageSwapCache now page is locked (above).
> > */
> > if (PageSwapCache(page) && !PageWriteback(page) &&
> > - (!page_mapped(page) || vm_swap_full())) {
> > + (!page_mapped(page) ||
> > + vm_swap_full(page_swap_info(page)))) {
> > delete_from_swap_cache(page);
> > SetPageDirty(page);
> > }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index df78d17..145c59c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -933,7 +933,7 @@ cull_mlocked:
> >
> > activate_locked:
> > /* Not a candidate for swapping, so reclaim swap space. */
> > - if (PageSwapCache(page) && vm_swap_full())
> > + if (PageSwapCache(page) && vm_swap_full(page_swap_info(page)))
> > try_to_free_swap(page);
> > VM_BUG_ON(PageActive(page));
> > SetPageActive(page);
> > --
> > 1.8.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Stephen Rothwell: "linux-next: manual merge of the hid tree with Linus' tree"
Previous message: Rafael J. Wysocki: "Re: [LKP] Commit ac3ebafa81a makes NHM EX/EP machines hung out since 3.9-rc1"
In reply to: Hugh Dickins: "Re: [RFC] mm: remove swapcache page early"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]