Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

From: Steven Rostedt
Date: Wed Jan 31 2024 - 07:57:49 EST


On Tue, 30 Jan 2024 22:20:24 -0800
Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Tue, 30 Jan 2024 at 21:57, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Ugh.
>
> Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how
> it does that "unlock and re-lock inode" thing.

I'd figured you'd like that one.

>
> It's a disease, and no, it's not an acceptable response to "lockdep
> shows there's a locking problem".
>
> The comment says "It is up to the individual mkdir routine to handle
> races" but that's *completely* bogus and shows a fundamental
> misunderstanding of locking.
>
> Dropping the lock in the middle of a locked section does NOT affect
> just the section that you dropped the lock around.
>
> It affects the *caller*, who took the lock in the first place, and who
> may well assume that the lock protects things for the whole duration
> of the lock.
>
> And so dropping the lock in the middle of an operation is a bad BAD
> *BAD* thing to do.
>
> Honestly, I had only been looking at the eventfs_inode.c lookup code.
> But now that I started looking at mkdir/rmdir, I'm seeing the same
> signs of horrible mistakes there too.
>
> And yes, it might be a real problem. For example, for the rmdir case,
> the actual lock was taken by 'vfs_rmdir()', and it does *not* protect
> only the ->rmdir() call itself.
>
> It also, for example, is supposed to make the ->rmdir be atomic wrt things like
>
> dentry->d_inode->i_flags |= S_DEAD;
>
> and
>
> dont_mount(dentry);
> detach_mounts(dentry);
>
> but now because the inode lock was dropped in the middle, for all we
> know a "mkdir()" could come in on the same name, and make a mockery of
> the whole inode locking.
>
> The inode lock on that directory that is getting removed is also what
> is supposed to make it impossible to add new files to the directory
> while the rmdir is going on.
>
> Again, dropping the lock violates those kinds of guarantees.
>
> "git blame" actually fingers Al for that "unlock and relock", but
> that's because Al did a search-and-replace on
> "mutex_[un]lock(&inode->i_mutex)" and replaced it with
> "inode_[un]lock(inode)" back in 2016.
>
> The real culprit is you, and that sh*t-show goes back to commit
> 277ba04461c2 ("tracing: Add interface to allow multiple trace
> buffers").
>
> Christ. Now I guess I need to start looking if there is anything else
> horrid lurking in that mkdir/rmdir path.
>
> I doubt the inode locking problem ends up mattering in this situation.
> Mostly since this is only tracefs, and normal users shouldn't be able
> to access these things anyway. And I think (hope?) that you only
> accept mkdir/rmdir in specific places.

Yes, this was very deliberate. It was for a very "special" feature, and
thus very limited.

>
> But this really is another case of "This code smells *fundamentally* wrong".
>
> Adding Al, in the hopes that he will take a look at this too.

This is something I asked Al about when I wrote it. This isn't a normal
rmdir. I remember telling Al what this was doing. Basically, it's just
a way to tell the tracing infrastructure to create a new instance. It
doesn't actually create a normal directory. It's similar to the
kprobe_events interface, where writing to a file would create
directories and files. Ideally I wanted a mkdir interface as it felt
more realistic, and I was ready to have another "echo foo > make_instance"
if this didn't work. But after talking with Al, I felt that it could
work. The main issue is that mkdir doesn't just make a directory, it
creates the entire tree (like what is done in /sys/fs/cgroup). If this
was more like kernfs instead of debugfs, I may not have had this
problem. That was another time I tried to understand how krenfs worked.

This is the reason all the opens in the tracing code has:

struct trace_array *tr = inode->i_private;
int ret;

ret = tracing_check_open_get_tr(tr);
if (ret)
return ret;


In kernel/trace/trace.c:

LIST_HEAD(ftrace_trace_arrays);

int trace_array_get(struct trace_array *this_tr)
{
struct trace_array *tr;
int ret = -ENODEV;

mutex_lock(&trace_types_lock);
list_for_each_entry(tr, &ftrace_trace_arrays, list) {
if (tr == this_tr) {
tr->ref++;
ret = 0;
break;
}
}
mutex_unlock(&trace_types_lock);

return ret;
}

static void __trace_array_put(struct trace_array *this_tr)
{
WARN_ON(!this_tr->ref);
this_tr->ref--;
}

void trace_array_put(struct trace_array *this_tr)
{
if (!this_tr)
return;

mutex_lock(&trace_types_lock);
__trace_array_put(this_tr);
mutex_unlock(&trace_types_lock);
}

int tracing_check_open_get_tr(struct trace_array *tr)
{
int ret;

ret = security_locked_down(LOCKDOWN_TRACEFS);
if (ret)
return ret;

if (tracing_disabled)
return -ENODEV;

if (tr && trace_array_get(tr) < 0)
return -ENODEV;

return 0;
}

The rmdir code that tracefs calls is also in kernel/trace/trace.c:

static int instance_rmdir(const char *name)
{
struct trace_array *tr;
int ret;

mutex_lock(&event_mutex);
mutex_lock(&trace_types_lock);

ret = -ENODEV;
tr = trace_array_find(name);
if (tr)
ret = __remove_instance(tr);

mutex_unlock(&trace_types_lock);
mutex_unlock(&event_mutex);

return ret;
}

The mkdir creates a new trace_array (tr) and rmdir destroys it. If
there's any reference on a trace array the rmdir fails. On every open,
a check is made to see if the trace array that was added to the inode
still exists, and if it does, it ups its ref count to prevent a rmdir
from happening.

It's a very limited mkdir and rmdir. I remember Al asking me questions
about what it can and can't do, and me telling him to make sure the
lock release wasn't going to cause problems.

I wrote several fuzzers on this code that perform mkdir and rmdir on the
instances while other tasks are trying to access them (reading and
writing). In the beginning, it found a few bugs. But I was finally able
to get my fuzzers to work non-stop for over a month and that's when I
felt that this is fine.

I was always weary of this code because of the locking situation. The
kernel selftests has a stress test on this code too.

tools/testing/selftests/ftrace/test.d/instances/instance-event.tc
tools/testing/selftests/ftrace/test.d/instances/instance.tc

Which is run as part of the selftest suite, which is run by most people
testing the kernel, including KernelCI.

I haven't had a problem with this code in years, and unless we find a
real bug that needs to be fixed, I'm afraid to touch it.

-- Steve