Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

From: Linus Torvalds
Date: Sun Jan 28 2024 - 23:36:41 EST


On Sun, 28 Jan 2024 at 18:59, kernel test robot <oliver.sang@xxxxxxxxx> wrote:
>
> 21: 48 8b 47 f8 mov -0x8(%rdi),%rax
> 25: 48 85 c0 test %rax,%rax
> 28: 74 11 je 0x3b
> 2a:* f6 40 78 02 testb $0x2,0x78(%rax) <-- trapping instruction

So this is

struct tracefs_inode *ti = get_tracefs(inode);
struct eventfs_inode *ei = ti->private;

if (!ei || !ei->is_events || ..

in set_top_events_ownership(), and it's that 'ei->is_events' test that oopses.

The 'inode' is the incoming argument (in %rdi), and you don't see code
generation for the "get_tracefs(inode)" because it's just an offset
off the inode.

So the "ti->private" read is that read off "-8(%rdi)", because

struct tracefs_inode {
unsigned long flags;
void *private;
struct inode vfs_inode;
};

so 'private' is 8 bytes below the 'struct inode' pointer.

So 'ei' isn't NULL, but it's a bad pointer, and 'ei->is_events' is
that "testb $0x2,0x78(%rax)" and it oopses as a result.

I don't think this is directly related to that commit 852e46e239ee
("eventfs: Do not create dentries nor inodes in iterate_shared") that
the kernel test robot talks about.

It looks like some inode creation never filled in the ->private field
at all, and it's just garbage.

I note that we have code like this:

create_dir_dentry():
...
mutex_unlock(&eventfs_mutex);

dentry = create_dir(ei, parent);

mutex_lock(&eventfs_mutex);
...
if (!ei->dentry && !ei->is_freed) {
ei->dentry = dentry;
eventfs_post_create_dir(ei);
dentry->d_fsdata = ei;
} else {

and that eventfs_post_create_dir() seems to be where it fills in that
->private pointer:

ti = get_tracefs(ei->dentry->d_inode);
ti->private = ei;

but notice how create_dir() has done that

d_instantiate(dentry, inode);

which exposes the inode to lookup - long before it's actually then filled in.

IOW, what I think is going on is a very basic race, where
create_dir_dentry() will allocate the inode and attach it to the
dentry long before the inode has been fully initialized.

So if somebody comes in *immediately* after that, and does a 'stat()'
on that name that now is exposed, it will see an inode that has not
yet made it to that eventfs_post_create_dir() phase, and that in turn
explains why

struct eventfs_inode *ei = ti->private;

just reads random garbage values.

So if I'm right (big "if" - it looks likely, but who knows) this bug
is entirely unrelated to any dentry caching or any reference counting.

It just needs just the right timing, where one CPU happens to do a
'stat()' just as another one creates the directory.

It's not a big window, so you do need some timing "luck".

End result: the d_instantiate() needs to be done *after* the inode has
been fully filled in.

Alternatively, you could

(a) not drop the eventfs_mutex around the create_dir() call

(b) take the eventfs_mutex around all of set_top_events_ownership()

and just fix it by having the lock protect the lack of ordering.

Linus