Re: [PATCH v2] vfs: shave work on failed file open

From: Jann Horn
Date: Thu Sep 28 2023 - 10:44:22 EST

Next message: Dmitry Vyukov: "Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq"
Previous message: Borislav Petkov: "Re: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors"
In reply to: Christian Brauner: "Re: [PATCH v2] vfs: shave work on failed file open"
Next in thread: Linus Torvalds: "Re: [PATCH v2] vfs: shave work on failed file open"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Sep 28, 2023 at 4:05 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> > So I spent a good chunk of time going through this patch.
>
> The main thing that makes me go "we shouldn't do this" is that KASAN
> isn't able to detect UAF issues as Jann pointed out so I'm getting
> really nervous about this.

(FWIW there is an in-progress patch to address this that I sent a few
weeks ago but that is not landed yet,
<https://lore.kernel.org/linux-mm/20230825211426.3798691-1-jannh@xxxxxxxxxx/>.
So currently KASAN can only detect UAF in SLAB_TYPESAFE_BY_RCU slabs
when the slab allocator has given them back to the page allocator.)

> And Jann also pointed out some potential issues with
> __fget_files_rcu() as well...

The issue I see with the current __fget_files_rcu() is that the
"file->f_mode & mask" is no longer effective in its current position,
it would have to be moved down below the get_file_rcu() call.
That's a semantic difference between manually RCU-freeing and
SLAB_TYPESAFE_BY_RCU - we no longer have the guarantee that an object
can't be freed and reallocated within a single RCU grace period.
With the current patch, we could race like this:

```
static inline struct file *__fget_files_rcu(struct files_struct *files,
unsigned int fd, fmode_t mask)
{
for (;;) {
struct file *file;
struct fdtable *fdt = rcu_dereference_raw(files->fdt);
struct file __rcu **fdentry;

if (unlikely(fd >= fdt->max_fds))
return NULL;

fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
file = rcu_dereference_raw(*fdentry);
if (unlikely(!file))
return NULL;

if (unlikely(file->f_mode & mask))
return NULL;

[in another thread:]
[file is removed from fd table and freed]
[file is reallocated as something like an O_PATH file,
which the check above would not permit]
[reallocated file is inserted in the fd table in the
same position]

/*
* Ok, we have a file pointer. However, because we do
* this all locklessly under RCU, we may be racing with
* that file being closed.
*
* Such a race can take two forms:
*
* (a) the file ref already went down to zero,
* and get_file_rcu() fails. Just try again:
*/
if (unlikely(!get_file_rcu(file))) [succeeds]
continue;

/*
* (b) the file table entry has changed under us.
* Note that we don't need to re-check the 'fdt->fd'
* pointer having changed, because it always goes
* hand-in-hand with 'fdt'.
*
* If so, we need to put our ref and try again.
*/
[recheck succeeds because the new file was inserted in
the same position]
if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
unlikely(rcu_dereference_raw(*fdentry) != file)) {
fput(file);
continue;
}

/*
* Ok, we have a ref to the file, and checked that it
* still exists.
*/
[a file incompatible with the supplied mask is returned]
return file;
}
}
```

There are also some weird get_file_rcu() users in other places like
BPF's task_file_seq_get_next and in gfs2_glockfd_next_file that do
weird stuff without the recheck, especially gfs2_glockfd_next_file
even looks at the inodes of files without taking a reference (which
seems a little dodgy but maybe actually currently works because inodes
are also RCU-freed?). So I think you'd have to clean all of that up
before you can make this change.

Similar thing with get_mm_exe_file(), that relies on get_file_rcu()
success meaning that the file was not reallocated. And tid_fd_mode()
in procfs assumes that task_lookup_fd_rcu() returns a file* whose mode
can be inspected under RCU.

As Linus already mentioned, release_empty_file() is also broken now,
because it assumes that nobody will grab references to unopened files,
but actually that can now happen spuriously when a concurrent fget()
has called get_file_rcu() on a recycled file and not yet hit the
recheck fput(). Kinda like the thing with "struct page" where GUP can
randomly spuriously bump up the refcount of any page including ones
that are not mapped into userspace. So that would have to go through
the same fput() path as every other file freeing.

We also now rely on the "f_count" initialization in init_file()
happening after the point of no return, which is currently the case,
but that'd have to be documented to avoid someone adding a later
bailout in the future, and maybe could be clarified by actually moving
the count initialization after the bailout?

Heh, I grepped for `__rcu.*file`, and BPF has a thing in
kernel/bpf/verifier.c that seems to imply it would be safe for some
types of BPF programs to follow the mm->exe_file reference solely
protected by RCU, which already seems a little dodgy now but more
after this change:

```
/* RCU trusted: these fields are trusted in RCU CS and can be NULL */
BTF_TYPE_SAFE_RCU_OR_NULL(struct mm_struct) {
struct file __rcu *exe_file;
};
```

(To be clear: This is not intended to be an exhaustive list.)

So I think conceptually this is something you can do but it would
require a bit of cleanup all around the kernel to make sure you really
just have one or two central functions that make use of the limited
RCU-ness of "struct file", and that nothing else relies on that or
makes assumptions about how non-zero refcounts move.

Next message: Dmitry Vyukov: "Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq"
Previous message: Borislav Petkov: "Re: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors"
In reply to: Christian Brauner: "Re: [PATCH v2] vfs: shave work on failed file open"
Next in thread: Linus Torvalds: "Re: [PATCH v2] vfs: shave work on failed file open"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]