Re: [PATCH v1 0/3] Speed up boot with faster linear map creation

From: Ryan Roberts
Date: Wed Mar 27 2024 - 06:43:28 EST


On 27/03/2024 10:09, Ard Biesheuvel wrote:
> Hi Ryan,
>
> On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And a large portion of the
>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>> each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc1):
>>
>> | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>> | VM, 16G | VM, 64G | VM, 256G | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>> | ms (%) | ms (%) | ms (%) | ms (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base | 151 (0%) | 2191 (0%) | 8990 (0%) | 17443 (0%)
>> no-cont-remap | 77 (-49%) | 429 (-80%) | 1753 (-80%) | 3796 (-78%)
>> no-alloc-remap | 77 (-49%) | 375 (-83%) | 1532 (-83%) | 3366 (-81%)
>> lazy-unmap | 63 (-58%) | 330 (-85%) | 1312 (-85%) | 2929 (-83%)
>>
>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>> tested all VA size configs (although I don't anticipate any issues); I'll do
>> this as part of followup.
>>
>
> These are very nice results!
>
> Before digging into the details: do we still have a strong case for
> supporting contiguous PTEs and PMDs in these routines?

We are currently using contptes and pmds for the linear map when rodata=[on|off]
IIRC? I don't see a need to remove the capability personally.

Also I was talking with Mark R yesterday and he suggested that an even better
solution might be to create a temp pgtable that maps the linear map with pmds,
switch to it, then create the real pgtable that maps the linear map with ptes,
then switch to that. The benefit being that we can avoid the fixmap entirely
when creating the second pgtable - we think this would likely be significantly
faster still.

My second patch adds the infrastructure to make this possible. But your changes
for LPA2 make it significantly more effort; since that change we are now using
the swapper pgtable when we populate the linear map into it - the kernel is
already mapped and that isn't done in paging_init() anymore. So I'm not quite
sure how we can easily make that work at the moment.

Thanks,
Ryan