Re: [RFC PATCH v2 1/6] KVM: gmem: Truncate pages on punch hole

From: Michael Roth
Date: Thu Oct 05 2023 - 13:53:28 EST


On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote:
> On Thu, Sep 21, 2023, Sean Christopherson wrote:
> > On Thu, Sep 21, 2023, isaku.yamahata@xxxxxxxxx wrote:
> > > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > >
> > > Although kvm_gmem_punch_hole() keeps all pages in mapping on punching hole,
> > > it's common expectation that pages are truncated. Truncate pages on
> > > punching hole. As page contents can be encrypted, avoid zeroing partial
> > > folio by refusing partial punch hole.
> > >
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > > ---
> > > virt/kvm/guest_mem.c | 14 ++++++++++++--
> > > 1 file changed, 12 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> > > index a819367434e9..01fb4ca861d0 100644
> > > --- a/virt/kvm/guest_mem.c
> > > +++ b/virt/kvm/guest_mem.c
> > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> > > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > > {
> > > struct list_head *gmem_list = &inode->i_mapping->private_list;
> > > + struct address_space *mapping = inode->i_mapping;
> > > pgoff_t start = offset >> PAGE_SHIFT;
> > > pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > > struct kvm_gmem *gmem;
> > >
> > > + /*
> > > + * punch hole may result in zeroing partial area. As pages can be
> > > + * encrypted, prohibit zeroing partial area.
> > > + */
> > > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK)
> > > + return -EINVAL;
> >
> > This should be unnecessary, kvm_gmem_fallocate() does
> >
> > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > return -EINVAL;
> >
> > before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test
> > fails, then that code needs to be fixed. I'll run your test to double-check,
> > but AFAICT this is unnecesary.
>
> I confirmed that the testcase passes without the extra checks. Just to close the
> loop, what prompted adding more checks to kvm_gmem_punch_hole()?

I don't know if it's the same issue that Isaku ran into, but for SNP we
hit a similar issue with the truncate_inode_pages_range(lstart, lend) call.

The issue in that case was a bit more subtle:

- userspace does a hole-punch on a 4K range of its gmem FD, which happens
to be backed by a 2MB folio.
- truncate_inode_pages_range() gets called for that 4K range
- truncate_inode_pages_range() does special handling on the folios at the
start/end of the range in case they are partial and passes these to
truncate_inode_partial_folio(folio, lstart, lend). In this case, there's
just the 1 backing folio. But it *still* gets the special treatment, and
so gets passed to truncate_inode_partial_folio().
- truncate_inode_partial_folio() will then zero that 4K range, even though
it is page-aligned, based on the following rationale in the comments:

/*
* We may be zeroing pages we're about to discard, but it avoids
* doing a complex calculation here, and then doing the zeroing
* anyway if the page split fails.
*/
folio_zero_range(folio, offset, length);

- after that, .invalidate_folio callback is issued, then the folio is split,
and the caller (truncate_inode_pages_range()) does another pass through
the whole range and can free the now-split folio then .free_folio callbacks
are issued.

Because of that, we can't rely on .invalidate_folio/.free_folio to handle
putting the page back into a normal host-accessible state, because the
zero'ing will happen beforehand. That's why we ended up needing to do this
for SNP patches to make sure arch-specific invalidation callbacks are issued
before the truncation occurs:

https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233

I'd planned to post those as a separate RFC to discuss, but when I came across
this it seemed like it might be relevant to what the TDX folks might ran into
here.

If not for the zero'ing logic mentioned above, for SNP at least, the
.free_folio() ends up working pretty nicely for both truncation and fput(),
and even plays nicely with live update use-case where the destination gmem
instance shares the inode->i_mapping, since iput() won't trigger the
truncate_inode_pages_final() until the last reference goes away so we don't
have to do anything special in kvm_gmem_release() to determine when we
should/shouldn't issue the arch-invalidations to clean up things like the
RMP table.

It seems like the above zero'ing logic could be reworked to only zero non-page
aligned ranges (as the comments above truncate_inode_pages_range() claim
should be the case), which would avoid the issue for the gmem use-case. But I
wonder if some explicit "dont-zero-these-pages" flag might be more robust.

Or maybe there's some other way we should be going about this?