Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Christian Brauner
Date: Thu Apr 20 2023 - 04:35:49 EST


On Wed, Apr 19, 2023 at 05:49:55PM -0700, Sean Christopherson wrote:
> On Wed, Apr 19, 2023, Christian Brauner wrote:
> > On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
> > > > But if you want to preserve the inode number and device number of the
> > > > relevant tmpfs instance but still report memfd restricted as your
> > > > filesystem type
> > >
> > > Unless I missed something along the way, reporting memfd_restricted as a distinct
> > > filesystem is very much a non-goal. AFAIK it's purely a side effect of the
> > > proposed implementation.
> >
> > In the current implementation you would have to put in effort to fake
> > this. For example, you would need to also implement ->statfs
> > super_operation where you'd need to fill in the details of the tmpfs
> > instance. At that point all that memfd_restricted fs code that you've
> > written is nothing but deadweight, I would reckon.
>
> After digging a bit, I suspect the main reason Kirill implemented an overlay to
> inode_operations was to prevent modifying the file size via ->setattr(). Relying
> on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design,
> the memory can't be mmap()'d into host userspace.
>
> if (attr->ia_valid & ATTR_SIZE) {
> if (memfd->f_inode->i_size)
> return -EPERM;
>
> if (!PAGE_ALIGNED(attr->ia_size))
> return -EINVAL;
> }
>
> But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or
> SHMEM_LONGPIN. For a variety of reasons, I'm leaning more and more toward making
> this a KVM ioctl() instead of a dedicated syscall, at which point we can be both
> more flexible and more draconian, e.g. let userspace provide the file size at the
> time of creation, but make the size immutable, at least by default.
>
> > > After giving myself a bit of a crash course in file systems, would something like
> > > the below have any chance of (a) working, (b) getting merged, and (c) being
> > > maintainable?
> > >
> > > The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem
> > > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are
> > > undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might
> > > be doable" or a "no, that's absolutely bonkers, don't try it".
> >
> > Maybe, but I think it's weird.
>
> Yeah, agreed.
>
> > _Replacing_ f_ops isn't something that's unprecedented. It happens everytime
> > a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs
> > does a similar (much more involved) thing where it replaces it's proxy f_ops
> > with the relevant subsystem's f_ops. The difference is that in both cases the
> > replace happens at ->open() time; and the replace is done once. Afterwards
> > only the newly added f_ops are relevant.
> >
> > In your case you'd be keeping two sets of {f,a}_ops; one usable by
> > userspace and another only usable by in-kernel consumers. And there are
> > some concerns (non-exhaustive list), I think:
> >
> > * {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is
> > authoritative per @file and it is left to the individual subsystems to
> > maintain driver specific ops (see the sunrpc stuff or sockets).
> > * lifetime management for the two sets of {f,a}_ops: If the ops belong
> > to a module then you need to make sure that the module can't get
> > unloaded while you're using the fops. Might not be a concern in this
> > case.
>
> Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?)
> holding a reference to the inode?

I don't think it would be possible to safely replace inode_operations
after the inode's been made visible in caches.

It works with file_operations because when a file is opened a new struct
file is allocated which isn't reachable anywhere before fd_install() is
called. So it is possible to replace f_ops in the default
f->f_op->open() method (which is what devices do as the inode is located
on e.g., ext4/xfs/tmpfs but the functionality of the device usually
provided by some driver/module through its file_operations). The default
f_ops are taken from i_fop of the inode.

The lifetime of the file_/inode_operations will be aligned with the
lifetime of the module they're originating from. If only
file_/inode_operations are used from within the same module then there
should never be any lifetime concerns.

So an inode doesn't explictly pin file_/inode_operations because there's
usually no need to do that and it be weird if each new inode would take
a reference on the f_ops/i_ops on the off-chance that someone _might_
open the file. Let alone the overhead of calling try_module_get()
everytime a new inode is added to the cache. There are various fs
objects - the superblock which is pinning the filesystem/module - that
exceed the lifetime of inodes and dentries. Both also may be dropped
from their respective caches and readded later.

Pinning of the module for f_ops is done because it is possible that some
filesystem/driver might want to use the file_operations of some other
filesystem/driver by default and they are in separate modules. So the
fops_get() in do_dentry_open is there because it's not guaranteed that
file_/inode_operations originate from the same module as the inode
that's opened. If the module is still alive during the open then a
reference to its f_ops is taken if not then the open will fail with
ENODEV.

That's to the best of my knowledge.

>
> > * brittleness: Not all f_ops for example deal with userspace
> > functionality some deal with cleanup when the file is closed like
> > ->release(). So it's delicate to override that functionality with
> > custom f_ops. Restricted memfds could easily forget to cleanup
> > resources.
> > * Potential for confusion why there's two sets of {f,a}_ops.
> > * f_ops specifically are generic across a vast amount of consumers and
> > are subject to change. If memfd_restricted() has specific requirements
> > because of this weird double-use they won't be taken into account.
> >
> > I find this hard to navigate tbh and it feels like taking a shortcut to
> > avoid building a proper api.
>
> Agreed. At the very least, it would be better to take an explicit dependency on
> whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate().
> I think that gives us a clearer path to getting something merged too, as we'll
> need Acks on making specific functions visible, i.e. will give MM maintainers
> something concrete to react too.
>
> > If you only care about a specific set of operations specific to memfd
> > restricte that needs to be available to in-kernel consumers, I wonder if you
> > shouldn't just go one step further then your proposal below and build a
> > dedicated minimal ops api.
>
> This is actually very doable for shmem. Unless I'm missing something, because
> our use case doesn't allow mmap(), swap, or migration, a good chunk of
> shmem_fallocate() is simply irrelevant. The result is only ~100 lines of code,
> and quite straightforward.
>
> My biggest concern, outside of missing a detail in shmem, is adding support for
> HugeTLBFS, which is likely going to be requested/needed sooner than later. At a
> glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm
> keen to duplicate. But that's also a future problem to some extent, as it's
> purely kernel internals; the uAPI side of things doesn't seem like it'll be messy
> at all.
>
> Thanks again!

Sure thing.