Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

From: Theodore Ts'o
Date: Thu Feb 17 2022 - 21:54:58 EST


On Wed, Feb 16, 2022 at 04:31:36PM +0000, Lee Jones wrote:
>
> After recently receiving a bug report from Syzbot [0] which was raised
> specifically against the Android v5.10 kernel, I spent some time
> trying to get to the crux. Firstly I reproduced the issue on the
> reported kernel, then did the same using the latest release kernel
> v5.16.
>
> The full kernel trace can be seen below at [1].
>

Lee, thanks for your work in trimming down the syzkaller reproducer.
The moral equivalent of what it is doing (except each system call is
done in a separate thread, with synchronization so each gets executed
in order, but perhaps on a different CPU) is:

int dest_fd, src_fd, truncate_fd, mmap_fd;
pid_t tid;
struct iovec local_iov, remote_iov;

dest_fd = open("./bus", O_RDWR|O_CREAT|O_NONBLOCK|O_SYNC|
O_DIRECT|O_LARGEFILE|O_NOATIME, 0);
src_fd = openat(AT_FDCWD, "/bin/bash", O_RDONLY);
truncate_fd = syscall(__NR_open, "./bus", O_RDWR|O_CREAT|O_SYNC|O_NOATIME|O_ASYNC);
ftruncate(truncate_fd, 0x2008002ul);
mmap((void *) 0x20000000ul /* addr */, 0x600000ul /* length */,
PROT_WRITE|PROT_EXEC|PROT_SEM||0x7ffff0, MAP_FIXED|MAP_SHARED, mmap_fd, 0);
sendfile(dest_fd, src_fd, 0 /* offset */, 0x80000005ul /* count */);
local_iov.iov_base = (void *) 0x2034afa4;
local_iov.iov_len = 0x1f80;
remote_iov.iov_base = (void *) 0x20000080;
remote_iov.iov_len = 0x2034afa5;
process_vm_writev(gettid(), &local_iov, 1, &remote_iov, 1, 0);
sendfile(dest_fd, src_fd, 0 /* offset */, 0x1f000005ul /* count */);

Which is then executed repeataedly over and over again. (With the
file descriptors closed at each loop, so the file is reopened each time.)

So basically, we have a scratch file which is initialized using an
sendfile using O_DIRECT. The scratch file is also mmap'ed into the
process's address space, and we then *modify* that an mmap'ed reagion
using process_vm_writev() --- and this is where the problem starts.

process_vm_writev() uses [un]pin_user_pages_remote() which is the same
interface uses for RDMA. But it's not clear this is ever supposed to
work for memory which is mmap'ed region backed by a file.
pin_user_pages_remote() appears to assume that it is an anonymous
region, since the get_user_pages functions in mm/gup.c don't call
read_page() to read data into any pages that might not be mmaped in.

They also don't follow the mm / file system protocol of calling the
file system's write_begin() or page_mkwrite() before modifying a page,
and so when when process_vm_writev() calls unpin_user_pages_remote(),
this tries to dirty the page, but since page_mkwrite() or
write_begin() hasn't been called, buffers haven't been attached to the
page, and so that triggers the following ext4 WARN_ON:

[ 1451.095939] WARNING: CPU: 1 PID: 449 at fs/ext4/inode.c:3565 ext4_set_page_dirty+0x48/0x50
...
[ 1451.139877] Call Trace:
[ 1451.140833] <TASK>
[ 1451.141889] folio_mark_dirty+0x2f/0x60
[ 1451.143408] set_page_dirty_lock+0x3e/0x60
[ 1451.145130] unpin_user_pages_dirty_lock+0x108/0x130
[ 1451.147405] process_vm_rw_single_vec.constprop.0+0x1b9/0x260
[ 1451.150135] process_vm_rw_core.constprop.0+0x156/0x280
[ 1451.159576] process_vm_rw+0xc4/0x110


Then when ext4_writepages() gets called, we trigger the BUG because
buffers haven't been attached to the page. We can certainly avoid the
BUG by checking to see if buffers are attached first, and if not, skip
writing the page trigger a WARN_ON, and then forcibly clear the page
dirty flag so don't permanently leak memory and allow the file system
to be unmounted. (Note: we can't fix the problem by attaching the
buffers in set_page_dirty(), since is occasionally called under
spinlocks and without the page being locked, so we can't do any kind
of allocation, so ix-nay on calling create_empty_buffers() which calls
alloc_buffer_head().)

BUT, that is really papering over the problem, since it's not clear
it's valid to try to use get_user_pages() and friends (including
[un]pin_user_pages_remote() on file-backed memory.

And if it is supposed to be valid, then mm/gup.c needs to first call
readpage() if the page is not present, so that if process_vm_writev()
is only modifying a few bytes in the mmap'ed region, we need to fault
in the page first, and then mm/gup.c needs to inform the file system
to make sure that if pinned memory region is not yet allocated, than
whatever needs to happen to allocate space, via page_mkwrite() has
taken place. (And by the way, that means that pin_user_pages_remote()
may need to return ENOSPC if there is not free space in the file
system, and hence ENOSPC may need to reflected all the way back to
process_vm_writev().

Alternatively, if we don't expect process_vm_writev() to work on
file-backed memory, perhaps it and pin_user_pages_remote() should
return some kind of error?

- Ted