Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

From: Ryan Roberts
Date: Tue Nov 14 2023 - 05:58:03 EST


On 13/11/2023 15:04, Matthew Wilcox wrote:
> On Mon, Nov 13, 2023 at 10:19:48AM +0000, Ryan Roberts wrote:
>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>> My hope is to abolish the 64kB page size configuration. ie instead of
>>> using the mixture of page sizes that you currently are -- 64k and
>>> 1M (right? Order-0, and order-4)
>>
>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>> intuitively you would expect the order to remain constant, but it doesn't).
>>
>> The "recommend" setting above will actually enable order-3 as well even though
>> there is no HW benefit to this. So the full set of available memory sizes here is:
>>
>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>
>>> , that 4k, 64k and 2MB (order-0,
>>> order-4 and order-9) will provide better performance.
>>>
>>> Have you run any experiements with a 4kB page size?
>>
>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>> to get to a world were we universally deal in variable sized chunks of memory,
>> aligned on 4K boundaries.
>>
>> In my experience though, there are still some performance benefits to 64K base
>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>> taking up a full cache line vs the former where a single cache line stores 8x
>> 64K entries.
>
> This is going to depend on your workload though -- if you're using more
> 2MB than 64kB, you get to elide a layer of page table with 4k base,
> rather than taking up 4 cache lines with a 64k base.

True, but again depending on workload/config, you may have few levels of lookup
for the 64K native case in the first place because you consume more VA bits at
each level.