Re: [RFC PATCH v1] mm/filemap: Allow arch to request folio size for exec memory

From: Ryan Roberts
Date: Mon Jan 15 2024 - 04:33:52 EST


On 13/01/2024 00:11, Barry Song wrote:
> On Sat, Jan 13, 2024 at 12:04 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>>
>> On Sat, Jan 13, 2024 at 11:54:23AM +1300, Barry Song wrote:
>>>>> Perhaps an alternative would be to double ra->size and set ra->async_size to
>>>>> (ra->size / 2)? That would ensure we always have 64K aligned blocks but would
>>>>> give us an async portion so readahead can still happen.
>>>>
>>>> this might be worth to try as PMD is exactly doing this because async
>>>> can decrease
>>>> the latency of subsequent page faults.
>>>>
>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> /* Use the readahead code, even if readahead is disabled */
>>>> if (vm_flags & VM_HUGEPAGE) {
>>>> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>>> ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
>>>> ra->size = HPAGE_PMD_NR;
>>>> /*
>>>> * Fetch two PMD folios, so we get the chance to actually
>>>> * readahead, unless we've been told not to.
>>>> */
>>>> if (!(vm_flags & VM_RAND_READ))
>>>> ra->size *= 2;
>>>> ra->async_size = HPAGE_PMD_NR;
>>>> page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
>>>> return fpin;
>>>> }
>>>> #endif
>>>>
>>>
>>> BTW, rather than simply always reading backwards, we did something very
>>> "ugly" to simulate "read-around" for CONT-PTE exec before[1]
>>>
>>> if page faults happen in the first half of cont-pte, we read this 64KiB
>>> and its previous 64KiB. otherwise, we read it and its next 64KiB.

I actually tried something very similar to this while prototyping. I found that
it was about 10% less effective at getting text into 64K folios as the approach
I posted. I didn't investigate why, as I came to the conclusion that text
unlikely benefits from readahead anyway.

>>
>> I don't think that makes sense. The CPU executes instructions forwards,
>> not "around". I honestly think we should treat text as "random access"
>> because function A calls function B and functions A and B might well be
>> very far apart from each other. The only time I see you actually
>> getting "readahead" hits is if a function is split across two pages (for
>> whatever size of page), but that's a false hit! The function is not,
>> generally, 64kB long, so doing readahead is no more likely to bring in
>> the next page of text that we want than reading any other random page.
>>
>
> it seems you are in favor of Ryan's modification even for filesystems
> which don't support large mapping?
>
>> Unless somebody finds the GNU Rope source code from 1998, or recreates it:
>> https://lwn.net/1998/1029/als/rope.html
>> Then we might actually have some locality.
>>
>> Did you actually benchmark what you did? Is there really some locality
>> between the code at offset 256-288kB in the file and then in the range
>> 192kB-256kB?
>
> I really didn't have benchmark data, at that point I was like,
> instinctively didn’t
> want to break the logic of read-around, so made the code just that.
> The info your provide makes me re-think if the read-around code is necessary,
> thanks!

As a quick experiment, I modified my thpmaps script to collect data *only* for
executable mappings. This is run *without* my change:

| File-backed exec folios | Speedometer | Kernel Compile |
|=========================|================|================|
|file-thp-aligned-16kB | 56% | 46% |
|file-thp-aligned-32kB | 2% | 3% |
|file-thp-aligned-64kB | 4% | 5% |
|file-thp-unaligned-16kB | 0% | 3% |
|file-thp-unaligned-128kB | 2% | 0% |
|file-thp-partial | 0% | 0% |

It's implied that the rest of the memory (up to 100%) is small (single page)
folios. I think the only reason we would see small folios is if we would
otherwise run off the end of the file?

If so, then I think any text in folios > 16K is a rough proxy for how effective
readahead is for text: Not very.

Intuitively, I agree with Matthew that readahead doesn't make much sense for
text, and this rough data seems to agree.


>
> was using filesystems without large-mapping support but worked around
> the problem by
> 1. preparing 16*n normals pages
> 2. insert normal pages into xa
> 3. let filesystem read 16 normal pages
> 4. after all IO completion, transform 16 pages into mTHP and reinsert
> mTHP to xa

I had a go at something like this too, but was doing it in the dynamic loader
and having it do MADV_COLLAPSE to generate PMD-sized THPs for the text. I
actaully found this to be even faster for the use cases I was measuring. But of
course its using more memory due to the 2M page size, and I expect it is slowing
down app load time because it is potentially reading in a lot more text than is
actually faulting. Ultimately I think the better strategy is to make the
filesystems large folio capable.

>
> that was very painful and finally made no improvement probably because
> of due to various sync overhead. so ran away and didn't dig more data.
>
> Thanks
> Barry