Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

From: Steven Rostedt
Date: Wed Jan 31 2024 - 13:39:32 EST


On Wed, 31 Jan 2024 10:08:37 -0800
Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Wed, 31 Jan 2024 at 05:14, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> > If you also notice, tracefs only allows mkdir/rmdir to be assigned to
> > one directory. Once it is assigned, no other directories can have mkdir
> > rmdir functionality.
>
> I think that does limit the damage, but it's not clear that it is actually safe.
>
> Because you don't hold the inode lock, somebody could come in and do a
> mkdir inside the other one that is being removed, ie something like
>
> - thread 1 does took the inode lock, called ->rmdir
>
> - it then drops the inode lock (both parent and the directory that is
> getting removed) and gets the event lock
>
> - but thread 2 can come in between that inode lock drop and event lock
>
> Notice: thread 2 comes in where there is *no* locking. Nada. Zilch.
>
> This is what worries me.

Yep, and that was my biggest concern too, which is why I have stress tests
that try to hit that above scenario. Well, rmdir and other accesses
including other mkdir's of the same name.

As my knowledge on the inode life time is still limited, my main concern
was just corruption of the dentry/inodes themselves. But the first one to
get the event_mutex determines the state of the file system.

If thread 1 is doing rmdir, what would thread 2 do that can harm it?

The rmdir calls trace_remove() which is basically retrying to remove the
directory again, and hopefully has the proper locking just like removing
the kprobe trace event that deletes directories. It can have references on
it.

Now if something were to get a reference, and a valid dentry, as soon as
the open function is called, the tracing logic will see that the
trace_array no longer exists and returns an error.

All the open functions for files that are created in an instance (and that
includes eventfs files) have a check to see if the inode->i_private data is
still valid. The trace_array structure represents the directory, and
there's a link list of all the trace_array structures that is protected by
the trace_types_lock. It grabs that lock, iterates the array to see if the
passed in trace_array is on it, if it is, it ups the ref count (preventing
a rmdir from succeeding) and returns it. If it is not, it returns NULL and
the open call fails as if it opened a nonexistent file.

>
> But it does help that it's all only in *one* directory. At least
> another mkdir cannot happen *inside* the one that is going away while
> the locks are not held. So the good news is that it does mean that
> there's only one point that is being protected.
>
> But I do worry about things like this (in vfs_rmdir()):
>
> inode_lock(dentry->d_inode);
>
> error = -EBUSY;
> if (is_local_mountpoint(dentry) ||
> (dentry->d_inode->i_flags & S_KERNEL_FILE))
> goto out;
>
> error = security_inode_rmdir(dir, dentry);
> if (error)
> goto out;
>
> error = dir->i_op->rmdir(dir, dentry);
> if (error)
> goto out;
>
> notice how it does that "is this a mount point" test atomically wrt
> the rmdir before it is allowed to proceed.

You mean if someone did:

# mkdir instances/foo
# rmdir instances/foo

and at the same time, someone else did

# mount -t ext4 /dev/sda instances/foo

?

OK, I never thought of that use case. Although, I think if someone is
trying to mount anything in tracefs, they can keep the pieces. ;-)

>
> And I do think that the inode lock is what also protects it from
> concurrent mounts. So now what happens when that "thread 2" above
> comes in while there is *no* locking, and mounts something there?
>
> Now, I'm not saying this is a huge problem. But it's very much an
> example of a thing that *could* be a problem. Dropping locks in the
> middle is just very scary.

No arguments from me. I really didn't like the dropping of the locks, and
tried hard to avoid it. If switching over to kernfs can solve that, I'd let
that conversion happen.

I'm all for someone switching tracefs over to kernfs if it solves all these
unknown bugs, as long as it doesn't hurt the memory savings of eventfs. But
again, I'm also willing to make eventfs its own file system (although I
don't have the time yet to do that) where tracefs isn't burdened by it.

-- Steve