Re: [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings

From: Barry Song
Date: Mon Jul 10 2023 - 08:05:39 EST


On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> Hi All,
>
> This is a series to opportunistically and transparently use contpte mappings
> (set the contiguous bit in ptes) for user memory when those mappings meet the
> requirements. It is part of a wider effort to improve performance of the 4K
> kernel with the aim of approaching the performance of the 16K kernel, but
> without breaking compatibility and without the associated increase in memory. It
> also benefits the 16K and 64K kernels by enabling 2M THP, since this is the
> contpte size for those kernels.
>
> Of course this is only one half of the change. We require the mapped physical
> memory to be the correct size and alignment for this to actually be useful (i.e.
> 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
> problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will
> allocate large folios up to the PMD size today, and more filesystems are coming.
> And the other half of my work, to enable the use of large folios for anonymous
> memory, aims to make contpte sized folios prevalent for anonymous memory too.
>
>
> Dependencies
> ------------
>
> While there is a complicated set of hard and soft dependencies that this patch
> set depends on, I wanted to split it out as best I could and kick off proper
> review independently.
>
> The series applies on top of these other patch sets, with a tree at:
> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1
>
> v6.4-rc6
> - base
>
> set_ptes()
> - hard dependency
> - Patch set from Matthew Wilcox to set multiple ptes with a single API call
> - Allows arch backend to more optimally apply contpte mappings
> - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@xxxxxxxxxxxxx/
>
> ptep_get() pte encapsulation
> - hard dependency
> - Enabler series from me to ensure none of the core code ever directly
> dereferences a pte_t that lies within a live page table.
> - Enables gathering access/dirty bits from across the whole contpte range
> - in mm-stable and linux-next at time of writing
> - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@xxxxxxx/
>
> Report on physically contiguous memory in smaps
> - soft dependency
> - Enables visibility on how much memory is physically contiguous and how much
> is contpte-mapped - useful for debug
> - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@xxxxxxx/
>
> Additionally there are a couple of other dependencies:
>
> anonfolio
> - soft dependency
> - ensures more anonymous memory is allocated in contpte-sized folios, so
> needed to realize the performance improvements (this is the "other half"
> mentioned above).
> - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@xxxxxxx/
> - Intending to post v1 shortly.
>
> exefolio
> - soft dependency
> - Tweak readahead to ensure executable memory are in 64K-sized folios, so
> needed to see reduction in iTLB pressure.
> - Don't intend to post this until we are further down the track with contpte
> and anonfolio.
>
> Arm ARM Clarification
> - hard dependency
> - Current wording disallows the fork() optimization in the final patch.
> - Arm (ATG) have proposed tightening the wording to permit it.
> - In conversation with partners to check this wouldn't cause problems for any
> existing HW deployments
>
> All of the _hard_ dependencies need to be resolved before this can be considered
> for merging.
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. anonfolio and exefolio are as
> described above. contpte is this series. (Note that exefolio only gives an
> improvement because contpte is already in place).
>
> Kernel Compilation (smaller is better):
>
> | kernel | real-time | kern-time | user-time |
> |:-------------|------------:|------------:|------------:|
> | baseline-4k | 0.0% | 0.0% | 0.0% |
> | anonfolio | -5.4% | -46.0% | -0.3% |
> | contpte | -6.8% | -45.7% | -2.1% |
> | exefolio | -8.4% | -46.4% | -3.7% |

sorry i am a bit confused. in exefolio case, is anonfolio included?
or it only has large cont-pte folios on exe code? in the other words,
Does the 8.4% improvement come from iTLB miss reduction only,
or from both dTLB and iTLB miss reduction?

> | baseline-16k | -8.7% | -49.2% | -3.7% |
> | baseline-64k | -10.5% | -66.0% | -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel | runs_per_min |
> |:-------------|---------------:|
> | baseline-4k | 0.0% |
> | anonfolio | 1.2% |
> | contpte | 3.1% |
> | exefolio | 4.2% |

same question as above.

> | baseline-16k | 5.3% |
>
> I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar
> gains.
>
> I've also verified that running the contpte changes without anonfolio and
> exefolio does not cause any regression vs baseline-4k.
>
>
> Opens
> -----
>
> The only potential issue that I see right now is that due to there only being 1
> access/dirty bit per contpte range, if a single page in the range is
> accessed/dirtied then all the adjacent pages are reported as accessed/dirtied
> too. Access/dirty is managed by the kernel per _folio_, so this information gets
> collapsed down anyway, and nothing changes there. However, the per _page_
> access/dirty information is reported through pagemap to user space. I'm not sure
> if this would/should be considered a break? Thoughts?
>
> Thanks,
> Ryan

Thanks
Barry