Re: [PATCH v1 11/26] x86/sev: Invalidate pages from the direct map when adding them to the RMP table

From: Michael Roth
Date: Tue Jan 16 2024 - 11:52:26 EST


On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote:
> I did some performance tests which do seem to indicate that
> pre-splitting the directmap to 4K can be substantially improve certain
> SNP guest workloads. This test involves running a single 1TB SNP guest
> with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to
> rapidly fault in all of its memory via lazy acceptance, and then
> measuring the rate that gmem pages are being allocated on the host by
> monitoring "FileHugePages" from /proc/meminfo to get some rough gauge
> of how quickly a guest can fault in it's initial working set prior to
> reaching steady state. The data is a bit noisy but seems to indicate
> significant improvement by taking the directmap updates out of the
> lazy acceptance path, and I would only expect that to become more
> significant as you scale up the number of guests / vCPUs.
>
> # Average fault-in rate across 3 runs, measured in GB/s
> unpinned | pinned to NUMA node 0
> DirectMap4K 12.9 | 12.1
> stddev 2.2 | 1.3
> DirectMap2M+split 8.0 | 8.9
> stddev 1.3 | 0.8
>
> The downside of course is potential impact for non-SNP workloads
> resulting from splitting the directmap. Mike Rapoport's numbers make
> me feel a little better about it, but I don't think they apply directly
> to the notion of splitting the entire directmap. It's Even he LWN article
> summarizes:
>
> "The conclusion from all of this, Rapoport continued, was that
> direct-map fragmentation just does not matter — for data access, at
> least. Using huge-page mappings does still appear to make a difference
> for memory containing the kernel code, so allocator changes should
> focus on code allocations — improving the layout of allocations for
> loadable modules, for example, or allowing vmalloc() to allocate huge
> pages for code. But, for kernel-data allocations, direct-map
> fragmentation simply appears to not be worth worrying about."
>
> So at the very least, if we went down this path, we would be worth
> investigating the following areas in addition to general perf testing:
>
> 1) Only splitting directmap regions corresponding to kernel-allocatable
> *data* (hopefully that's even feasible...)
> 2) Potentially deferring the split until an SNP guest is actually
> run, so there isn't any impact just from having SNP enabled (though
> you still take a hit from RMP checks in that case so maybe it's not
> worthwhile, but that itself has been noted as a concern for users
> so it would be nice to not make things even worse).

There's another potential area of investigation I forgot to mention that
doesn't involve pre-splitting the directmap. It makes use of the fact
that the kernel should never be accessing a 2MB mapping that overlaps with
private guest memory if the backing PFN for the guest memory is a 2MB page.
Since there's no chance for overlap (well, maybe via a 1GB directmap entry,
but not as dramatic a change to force those to 2M), there's no need to
actually split the directmap entry in these cases since they won't
result in unexpected RMP faults.

So if pre-splitting the directmap ends up having too many downsides, then
there may still some potential for optimizing the current approach to a
fair degree.

-Mike