Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead

From: Hugh Dickins
Date: Wed Nov 14 2012 - 04:48:18 EST


On Tue, 6 Nov 2012, Shaohua Li wrote:
> On Wed, Oct 24, 2012 at 09:13:56AM +0800, Shaohua Li wrote:
> > On Tue, Oct 23, 2012 at 09:41:00AM -0400, Rik van Riel wrote:
> > > On 10/23/2012 01:51 AM, Shaohua Li wrote:
> > >
> > > >I have no strong point against the global state method. But I'd agree making the
> > > >heuristic simple is preferred currently. I'm happy about the patch if the '+1'
> > > >is removed.
> > >
> > > Without the +1, how will you figure out when to re-enable readahead?
> >
> > Below code in swapin_nr_pages can recover it.
> > + if (offset == prev_offset + 1 || offset == prev_offset - 1)
> > + pages <<= 1;
> >
> > Not perfect, but should work in some sort. This reminds me to think if
> > pagereadahead flag is really required, hit in swap cache is a more reliable way
> > to count readahead hit, and as Hugh mentioned, swap isn't vma bound.
>
> Hugh,
> ping! Any chance you can check this again?

I apologize, Shaohua, my slowness must be very frustrating for you,
as it is for me too.

Thank you for pointing out how my first patch was reading two pages
instead of one in the random case, explaining its disappointing
performance there: odd how blind I was to that, despite taking stats.

I did experiment with removing the "+ 1" as you did, it worked well
in the random SSD case, but degraded performance in (all? I forget)
the other cases.

I failed to rescue the "algorithm" in that patch, and changed it a
week ago for an even simpler one, that has worked well for me so far.
When I sent you a quick private ack to your ping, I was puzzled by its
"too good to be true" initial test results: once I looked into those,
found my rearrangement of the test script had left out a swapoff,
so the supposed harddisk tests were actually swapping to SSD.

I've finally got around to assembling the results and writing up
some description, starting off from yours. I think I've gone as
far as I can with this, and don't want to hold you up with further
delays: would it be okay if I simply hand this patch over to you now,
to test and expand upon and add your Sign-off and send in to akpm to
replace your original in mmotm - IF you are satisfied with it?


[PATCH] swap: add a simple detector for inappropriate swapin readahead

swapin readahead does a blind readahead, whether or not the swapin
is sequential. This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages? But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.

This patch adds very simplistic random read detection. Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches. Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below. Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.

HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 73921 76210 75611 76904 78191 121542
Seq Shmem 73601 73176 73855 72947 74543 118322
Rand Anon 895392 831243 871569 845197 846496 841680
Rand Shmem 1058375 1053486 827935 764955 764376 756489

SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 24634 24198 24673 25107 21614 70018
Seq Shmem 24959 24932 25052 25703 22030 69678
Rand Anon 43014 26146 28075 25989 26935 25901
Rand Shmem 45349 45215 28249 24268 24138 24332

These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.

Original-patch-by: Shaohua Li <shli@xxxxxxxxxxxx>
Original-patch-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxx>
Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
---

include/linux/page-flags.h | 4 +-
mm/swap_state.c | 55 +++++++++++++++++++++++++++++++++--
2 files changed, 54 insertions(+), 5 deletions(-)

--- 3.7-rc5/include/linux/page-flags.h 2012-09-30 16:47:46.000000000 -0700
+++ linux/include/linux/page-flags.h 2012-11-11 09:45:30.908591576 -0800
@@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
PAGEFLAG(MappedToDisk, mappedtodisk)

-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)

#ifdef CONFIG_HIGHMEM
/*
--- 3.7-rc5/mm/swap_state.c 2012-09-30 16:47:46.000000000 -0700
+++ linux/mm/swap_state.c 2012-11-11 09:50:06.520598126 -0800
@@ -53,6 +53,8 @@ static struct {
unsigned long find_total;
} swap_cache_info;

+static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
+
void show_swap_cache_info(void)
{
printk("%lu pages in swap cache\n", total_swapcache_pages);
@@ -265,8 +267,11 @@ struct page * lookup_swap_cache(swp_entr

page = find_get_page(&swapper_space, entry.val);

- if (page)
+ if (page) {
INC_CACHE_INFO(find_success);
+ if (TestClearPageReadahead(page))
+ atomic_inc(&swapin_readahead_hits);
+ }

INC_CACHE_INFO(find_total);
return page;
@@ -351,6 +356,42 @@ struct page *read_swap_cache_async(swp_e
return found_page;
}

+unsigned long swapin_nr_pages(unsigned long offset)
+{
+ static unsigned long prev_offset;
+ unsigned int pages, max_pages;
+
+ max_pages = 1 << ACCESS_ONCE(page_cluster);
+ if (max_pages <= 1)
+ return 1;
+
+ /*
+ * This heuristic has been found to work well on both sequential and
+ * random loads, swapping to hard disk or to SSD: please don't ask
+ * what the "+ 2" means, it just happens to work well, that's all.
+ */
+ pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
+ if (pages == 2) {
+ /*
+ * We can have no readahead hits to judge by: but must not get
+ * stuck here forever, so check for an adjacent offset instead
+ * (and don't even bother to check whether swap type is same).
+ */
+ if (offset != prev_offset + 1 && offset != prev_offset - 1)
+ pages = 1;
+ prev_offset = offset;
+ } else {
+ unsigned int roundup = 4;
+ while (roundup < pages)
+ roundup <<= 1;
+ pages = roundup;
+ }
+
+ if (pages > max_pages)
+ pages = max_pages;
+ return pages;
+}
+
/**
* swapin_readahead - swap in pages in hope we need them soon
* @entry: swap entry of this memory
@@ -374,11 +415,16 @@ struct page *swapin_readahead(swp_entry_
struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
- unsigned long offset = swp_offset(entry);
+ unsigned long entry_offset = swp_offset(entry);
+ unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
- unsigned long mask = (1UL << page_cluster) - 1;
+ unsigned long mask;
struct blk_plug plug;

+ mask = swapin_nr_pages(offset) - 1;
+ if (!mask)
+ goto skip;
+
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
end_offset = offset | mask;
@@ -392,10 +438,13 @@ struct page *swapin_readahead(swp_entry_
gfp_mask, vma, addr);
if (!page)
continue;
+ if (offset != entry_offset)
+ SetPageReadahead(page);
page_cache_release(page);
}
blk_finish_plug(&plug);

lru_add_drain(); /* Push any new pages onto the LRU now */
+skip:
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/