Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression

From: Matthew Wilcox
Date: Thu Mar 07 2024 - 13:20:05 EST

Next message: Zi Yan: "[PATCH 1/2] mm/huge_memory: check new folio order when split a folio"
Previous message: Andrea Parri: "Re: Question about PB rule of LKMM"
In reply to: Jan Kara: "Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote:
> Thanks for testing! This is an interesting result and certainly unexpected
> for me. The readahead code allocates naturally aligned pages so based on
> the distribution of allocations it seems that before commit ab4443fe3ca6
> readahead window was at least 32 pages (128KB) aligned and so we allocated
> order 5 pages. After the commit, the readahead window somehow ended up only
> aligned to 20 modulo 32. To follow natural alignment and fill 128KB
> readahead window we allocated order 2 page (got us to offset 24 modulo 32),
> then order 3 page (got us to offset 0 modulo 32), order 4 page (larger
> would not fit in 128KB readahead window now), and order 2 page to finish
> filling the readahead window.
>
> Now I'm not 100% sure why the readahead window alignment changed with
> different rounding when placing readahead mark - probably that's some
> artifact when readahead window is tiny in the beginning before we scale it
> up (I'll verify by tracing whether everything ends up looking correctly
> with the current code). So I don't expect this is a problem in ab4443fe3ca6
> as such but it exposes the issue that readahead page insertion code should
> perhaps strive to achieve better readahead window alignment with logical
> file offset even at the cost of occasionally performing somewhat shorter
> readahead. I'll look into this once I dig out of the huge heap of email
> after vacation...

I was surprised by what you said here, so I went and re-read the code
and it doesn't work the way I thought it did. So I had a good long think
about how it _should_ work, and I looked for some more corner conditions,
and this is what I came up with.

The first thing I've done is separate out the two limits. The EOF is
a hard limit; we will not allocate pages beyond EOF. The ra->size is
a soft limit; we will allocate pages beyond ra->size, but not too far.

The second thing I noticed is that index + ra_size could wrap. So add
a check for that, and set it to ULONG_MAX. index + ra_size - async_size
could also wrap, but this is harmless. We certainly don't want to kick
off any more readahead in this circumstance, so leaving 'mark' outside
the range [index..ULONG_MAX] is just fine.

The third thing is that we could allocate a folio which contains a page
at ULONG_MAX. We don't really want that in the page cache; it makes
filesystems more complicated if they have to check for that, and we
don't allow an order-0 folio at ULONG_MAX, so there's no need for it.
This _should_ already be prohibited by the "Don't allocate pages past EOF"
check, but let's explicitly prohibit it.

Compile tested only.

diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..742e1f39035b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
{
struct address_space *mapping = ractl->mapping;
pgoff_t index = readahead_index(ractl);
- pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+ pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+ pgoff_t limit = index + ra->size;
pgoff_t mark = index + ra->size - ra->async_size;
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);
@@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl,
if (!mapping_large_folio_support(mapping) || ra->size < 4)
goto fallback;

- limit = min(limit, index + ra->size - 1);
-
if (new_order < MAX_PAGECACHE_ORDER) {
new_order += 2;
new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
}

+ if (limit < index)
+ limit = ULONG_MAX;
filemap_invalidate_lock_shared(mapping);
- while (index <= limit) {
+ while (index < limit) {
unsigned int order = new_order;

/* Align with smaller pages if needed */
if (index & ((1UL << order) - 1))
order = __ffs(index);
+ /* Avoid wrap */
+ if (index + (1UL << order) == 0)
+ order--;
/* Don't allocate pages past EOF */
- while (index + (1UL << order) - 1 > limit)
+ while (index + (1UL << order) - 1 > last)
order--;
err = ra_alloc_folio(ractl, index, mark, order, gfp);
if (err)

Next message: Zi Yan: "[PATCH 1/2] mm/huge_memory: check new folio order when split a folio"
Previous message: Andrea Parri: "Re: Question about PB rule of LKMM"
In reply to: Jan Kara: "Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]