Re: [PATCH v6] fuse: Add support for passthrough read/write

From: Jann Horn
Date: Thu Aug 13 2020 - 14:30:52 EST


On Thu, Aug 13, 2020 at 3:28 PM Alessio Balsini <balsini@xxxxxxxxxxx> wrote:
> On Wed, Aug 12, 2020 at 08:29:58PM +0200, 'Jann Horn' via kernel-team wrote:
[...]
> > On Wed, Aug 12, 2020 at 6:15 PM Alessio Balsini <balsini@xxxxxxxxxxx> wrote:
> > > Add support for filesystem passthrough read/write of files when enabled in
> > > userspace through the option FUSE_PASSTHROUGH.
[...]
> > Unfortunately, this step isn't really compatible with how the current
> > FUSE API works. Essentially, the current FUSE API is reached via
> > operations like the write() syscall, and reaches FUSE through either
> > ->write_iter or ->splice_write in fuse_dev_operations. In that
> > context, operations like fget_raw() that operate on the calling
> > process are unsafe.
> >
> > The reason why operations like fget() are unsafe in this context is
> > that the security model of Linux fundamentally assumes that if you get
> > a file descriptor from an untrusted process, and you write stuff into
> > it, anything that happens will be limited to things that the process
> > that gave you the file descriptor would've been able to do on its own.
> > Or in other words, an attacker shouldn't be able to gain anything by
> > convincing a privileged process to write attacker-controlled data into
> > an attacker-supplied file descriptor. With the current design, an
> > attacker may be able to trick a privileged process into installing one
> > of its FDs as a passthrough FD into an attacker-controlled FUSE
> > instance (while the privileged process thinks that it's just writing
> > some opaque data into some file), and thereby make it possible for an
> > attacker to indirectly gain the ability to read/write that FD.
> >
>
> This is a great explanation.
>
> I've been thinking about this before posting the patch and my final thought
> was that being the FUSE daemon already responsible for handling file ops
> coming from untrusted processes, and the privileges of these ops are anyway
> escalated to the daemon's, if the FUSE daemon has vulnerabilities to
> exploit,

This is not about vulnerabilities in the FUSE daemon.

> there's not much we can do to avoid an attacker to mess with files
> at the same permission level as the FUSE daemon. And this is true also
> without this patch, right?
> IOW, the feature introduced here is something that I agree should be
> handled with care, but is there something that can go wrong if the FUSE
> daemon is written properly? If we cannot trust the FUSE daemon, then we
> should also not trust what it would be able to access, so isn't it enough
> to prove that an attacker wouldn't be able to get more privileges than the
> FUSE daemon? Sorry if I missed something.

My point is that you can use this security issue to steal file
descriptors from processes that are _not_ supposed to act as FUSE
daemons. For example, let's say there is some setuid root executable
that opens /etc/shadow and then writes a message to stdout. If I
invoke this setuid root executable with stdout pointing to a FUSE
device fd, and I can arrange for the message written by the setuid
process to look like a FUSE message, I can trick the setuid process to
install its /etc/shadow fd as a passthrough fd on a FUSE file, and
then, through the FUSE filesystem, mess with /etc/shadow.

Also, for context: While on Android, access to /dev/fuse is
restricted, desktop Linux systems allow unprivileged users to use
FUSE.

> > The only way I see to fix this somewhat cleanly would be to add a new
> > command to fuse_dev_ioctl() that can be used to submit replies as an
> > alternative to the write()-based interface. (That should probably be a
> > separate patch in this patch series.) Then, you could have a flag
> > argument to fuse_dev_do_write() that tells it whether the ioctl
> > interface was used, and use that information to decide whether
> > fuse_setup_passthrough() is allowed.
> > (An alternative approach would be to require userspace to set some
> > flag on the write operation that says "I am intentionally performing
> > an operation that depends on caller identity" and pass that through -
> > e.g. by using pwritev2()'s flags argument. But I think historically
> > the stance has been that stuff like write() simply should not be
> > looking at the calling process.)
> >
>
> I'm not sure I got it right. Could you please elaborate on what is the
> purpose of the new fuse_dev_ioctl() command?
> Do you mean letting the kernel decide whether a FUSE daemon is allowed to
> run fuse_setup_passthrough() or to decide if passthrough should be allowed
> on a specific file?

The new fuse_dev_ioctl() command would behave just like
fuse_dev_write(), except that the ioctl-based interface would permit
OPEN/CREATE replies with FOPEN_PASSTHROUGH, while the write()-based
interface would reject them.

> In general, I prefer to avoid as much as I can API changes, let's see if
> there is a way to leave them untouched. :)
> What do you think about adding some extra checkings in
> fuse_setup_passthrough() to let the kernel decide if passthrough is doable?
> Do you think this would make things better regarding what you mentioned?

I can't think of any checks that you could do there to make it safe,
because fundamentally what you have to figure out is _userspace's
intent_, and you can't figure out what that is if userspace just calls
write().

[...]
> > > + if (!passthrough_filp)
> > > + return;
> >
> > This error path can only be reached if the caller supplied invalid
> > data. IMO this should bail out with an error.
>
> As you can see I switched from the BUG_ON() approach of the previous patch
> to the extreme opposite of transparent, graceful error handling.
> Do you think we should abort the whole open operation, or adding a few
> warning messages may suffice?

In this case, an error indicates that the userspace programmer made a
mistake. So even if the userspace programmer is not looking at kernel
logs, we should indicate to them that they messed up - and we can do
that by returning an error code from the syscall. So I think we should
ideally abort the operation in this case.

[...]
> > > + passthrough_inode = file_inode(passthrough_filp);
> > > +
> > > + iocb->ki_filp = passthrough_filp;
> >
> > Hmm... so we're temporarily switching out the iocb's ->ki_filp here? I
> > wonder whether it is possible for some other code to look at ->ki_filp
> > concurrently... maybe Jens Axboe knows whether that could plausibly
> > happen?
> >
> > Second question about this switcheroo below...
> >
> > > + if (write) {
> > > + if (!passthrough_filp->f_op->write_iter)
> > > + goto out;
> > > +
> > > + ret = call_write_iter(passthrough_filp, iocb, iter);
> >
> > Hmm, I don't think we can just invoke
> > call_write_iter()/call_read_iter() like this. If you look at something
> > like vfs_writev(), you can see that normally, there are a bunch of
> > other things that happen:
> >
> > - file_start_write() before the write
> > - check whether the file's ->f_mode permits writing with FMODE_WRITE
> > and FMODE_CAN_WRITE
> > - rw_verify_area() for stuff like mandatory locking and LSM security
> > checks (although admittedly this LSM security check is pretty useless)
> > - fsnotify_modify() to trigger inotify watches and such that notify
> > userspace of file modifications
> > - file_end_write() after the write
> >
> > You should probably try to use vfs_iocb_iter_write() here, and figure
> > out how to properly add file_start_write()/file_end_write() calls
> > around this. Perhaps ovl_write_iter() from fs/overlayfs/file.c can
> > help with this? Note that you can't always just call file_end_write()
> > synchronously, since the request may complete asynchronously.
>
> Answering here both the two previous questions, that are strictly related.
> I couldn't find any racy path for a specific iocp->ki_filp. Glad to be
> proven wrong, let's see if Jens can bring some light here.
>
> Jumping back to vfs with call_{read,write}_iter(), this looked to me as the
> most elegant solution, and I find it handy that they also perform extra
> checks on the target file.

No, call_write_iter() doesn't do any of the required checking and
setup and notification steps, it just calls into the file's
->write_iter handler:

static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
struct iov_iter *iter)
{
return file->f_op->write_iter(kio, iter);
}

Did you confuse call_write_iter() with vfs_iocb_iter_write(), or
something like that?

> I was worried at the beginning about this vfs
> nesting weirdness, but the nesting unrolls just fine, and with visual
> inspection I couldn't spot any dangerous situations.
> Are you just worried about write operations or reads as well? Maybe Jens
> can give some extra advice here too.
>
> > > +out:
> > > + iocb->ki_filp = fuse_filp;
> >
> > Also a question that I hope Jens can help with: If this is an
> > asynchronous request, would something bad happen if the request
> > completes before we reach that last ->ki_filp write (if that is even
> > possible)? Or could an asynchronous request blow up because this
> > ->ki_filp write happens before the request has completed?
>
> I cannot think of a scenario where we don't complete the request before
> restoring the original ki_filp. Jens again to prove me wrong.

Requests can complete asynchronously. That means call_write_iter() can
return more or less immediately, while the request is processed in the
background, and at some point, a callback is invoked. That's what that
-EIOCBQUEUED return value is about. In that case, the current code
will change ->ki_filp while the request is still being handled in the
background.

I recommend looking at ovl_write_iter() in overlayfs to see how
they're dealing with this case.