Re: [PATCH RFC] mm readahead: Fix the readahead fail in case ofempty numa node

From: Jan Kara
Date: Thu Dec 12 2013 - 06:14:38 EST


On Wed 11-12-13 15:05:22, Andrew Morton wrote:
> On Wed, 11 Dec 2013 23:49:17 +0100 Jan Kara <jack@xxxxxxx> wrote:
>
> > > /*
> > > - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
> > > - * sensible upper limit.
> > > + * max_sane_readahead() is disabled. It can later be removed altogether, but
> > > + * let's keep a skeleton in place for now, in case disabling was the wrong call.
> > > */
> > > unsigned long max_sane_readahead(unsigned long nr)
> > > {
> > > - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> > > - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> > > + return nr;
> > > }
> > >
> > > /*
> > >
> > > Can anyone see a problem with this?
> > Well, the downside seems to be that if userspace previously issued
> > MADV/FADV_WILLNEED on a huge file, we trimmed the request to a sensible
> > size. Now we try to read the whole huge file which is pretty much
> > guaranteed to be useless (as we'll be pushing out of cache data we just
> > read a while ago). And guessing the right readahead size from userspace
> > isn't trivial so it would make WILLNEED advice less useful. What do you
> > think?
>
> OK, yes, there is conceivably a back-compatibility issue there. There
> indeed might be applications which decide the chuck the whole thing at
> the kernel and let the kernel work out what is a sensible readahead
> size to perform.
>
> But I'm really struggling to think up an implementation! The current
> code looks only at the caller's node and doesn't seem to make much
> sense. Should we look at all nodes? Hard to say without prior
> knowledge of where those pages will be coming from.
Well, I believe that we might have some compatibility issues only for
non-NUMA machines - there the current logic makes sense. For NUMA machines
I believe we are free to do basically anything because results of the
current logic are pretty random.

Thinking about proper implementation for NUMA - max_sane_readahead() is
really interesting for madvise() and fadvise() calls (standard on demand
readahead is bounded by bdi->ra_pages which tends to be pretty low anyway
(like 512K or so)). For these calls we will do the reads from the process
issuing the [fm]advise() call and thus we will allocate pages depending on
the NUMA policy. So depending on this policy we should be able to pick some
estimate on the number of available pages, shouldn't we?

BTW, the fact that [fm]advise() calls submit all reads synchronously is
another reason why we should bound the readahead requests to a sensible
size.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/