Re: io_uring: risky use of task work, especially wrt fdget()

From: Jens Axboe
Date: Tue Nov 28 2023 - 11:28:27 EST


On 11/28/23 9:19 AM, Jens Axboe wrote:
> On 11/28/23 8:58 AM, Jens Axboe wrote:
>> On 11/27/23 2:53 PM, Jann Horn wrote:
>>> Hi!
>>>
>>> I noticed something that I think does not currently cause any
>>> significant security issues, but could be problematic in the future:
>>>
>>> io_uring sometimes processes task work in the middle of syscalls,
>>> including between fdget() and fdput(). My understanding of task work
>>> is that it is expected to run in a context similar to directly at
>>> syscall entry/exit: task context, no locks held, sleeping is okay, and
>>> it doesn't execute in the middle of some syscall that expects private
>>> state of the task_struct to stay the same.
>>>
>>> An example of another user of task work is the keyring subsystem,
>>> which does task_work_add() in keyctl_session_to_parent() to change the
>>> cred pointers of another task.
>>>
>>> Several places in io_uring process task work while holding an fdget()
>>> reference to some file descriptor. For example, the io_uring_enter
>>> syscall handler calls io_iopoll_check() while the io_ring_ctx is only
>>> referenced via fdget(). This means that if there were another kernel
>>> subsystem that uses task work to close file descriptors, io_uring
>>> would become unsafe. And io_uring does _almost_ that itself, I think:
>>> io_queue_worker_create() can be run on a workqueue, and uses task work
>>> to launch a worker thread from the context of a userspace thread; and
>>> this worker thread can then accept commands to close file descriptors.
>>> Except it doesn't accept commands to close io_uring file descriptors.
>>>
>>> A closer miss might be io_sync_cancel(), which holds a reference to
>>> some normal file with fdget()/fdput() while calling into
>>> io_run_task_work_sig(). However, from what I can tell, the only things
>>> that are actually done with this file pointer are pointer comparisons,
>>> so this also shouldn't have significant security impact.
>>>
>>> Would it make sense to use fget()/fput() instead of fdget()/fdput() in
>>> io_sync_cancel(), io_uring_enter and io_uring_register? These
>>> functions probably usually run in multithreaded environments anyway
>>> (thanks to the io_uring worker threads), so I would think fdget()
>>> shouldn't bring significant performance savings here?
>>
>> Let me run some testing on that. It's a mistake to think that it's
>> usually multithreaded, generally if you end up using io-wq then it's not
>> a fast path. A fast networked setup, for example, would never touch the
>> threads and hence no threading would be implied by using io_uring. Ditto
>> on the storage front, if you're just reading/writing or eg doing polled
>> IO. That said, those workloads are generally threaded _anyway_ - not
>> because of io_uring, but because that's how these kinds of workloads are
>> written to begin with.
>>
>> So probably won't be much of a concern to do the swap. The only
>> "interesting" part of the above mix of cancel/register/enter is
>> obviously the enter part. The rest are not really fast path.
>
> Did all three and ran the usual testing, which just so happens to be
> multithreaded to begin with anyway. No discernable change from using
> fget/fput over fdget/fdput.
>
> IOW, we may as well do this. Do you want to send a patch? Or I can send
> out mine, up to you.

For reference, this is what I ran my testing with:

diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index 3c19cccb1aec..8a8b07dfc444 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -273,7 +273,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
};
ktime_t timeout = KTIME_MAX;
struct io_uring_sync_cancel_reg sc;
- struct fd f = { };
+ struct file *file = NULL;
DEFINE_WAIT(wait);
int ret, i;

@@ -295,10 +295,10 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
/* we can grab a normal file descriptor upfront */
if ((cd.flags & IORING_ASYNC_CANCEL_FD) &&
!(cd.flags & IORING_ASYNC_CANCEL_FD_FIXED)) {
- f = fdget(sc.fd);
- if (!f.file)
+ file = fget(sc.fd);
+ if (!file)
return -EBADF;
- cd.file = f.file;
+ cd.file = file;
}

ret = __io_sync_cancel(current->io_uring, &cd, sc.fd);
@@ -348,6 +348,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
if (ret == -ENOENT || ret > 0)
ret = 0;
out:
- fdput(f);
+ if (file)
+ fput(file);
return ret;
}
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 05f933dddfde..aba5657d287e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3652,7 +3652,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
size_t, argsz)
{
struct io_ring_ctx *ctx;
- struct fd f;
+ struct file *file;
long ret;

if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
@@ -3670,20 +3670,19 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
if (unlikely(!tctx || fd >= IO_RINGFD_REG_MAX))
return -EINVAL;
fd = array_index_nospec(fd, IO_RINGFD_REG_MAX);
- f.file = tctx->registered_rings[fd];
- f.flags = 0;
- if (unlikely(!f.file))
+ file = tctx->registered_rings[fd];
+ if (unlikely(!file))
return -EBADF;
} else {
- f = fdget(fd);
- if (unlikely(!f.file))
+ file = fget(fd);
+ if (unlikely(!file))
return -EBADF;
ret = -EOPNOTSUPP;
- if (unlikely(!io_is_uring_fops(f.file)))
+ if (unlikely(!io_is_uring_fops(file)))
goto out;
}

- ctx = f.file->private_data;
+ ctx = file->private_data;
ret = -EBADFD;
if (unlikely(ctx->flags & IORING_SETUP_R_DISABLED))
goto out;
@@ -3777,7 +3776,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
}
}
out:
- fdput(f);
+ if (!(flags & IORING_ENTER_REGISTERED_RING))
+ fput(file);
return ret;
}

@@ -4618,7 +4618,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
{
struct io_ring_ctx *ctx;
long ret = -EBADF;
- struct fd f;
+ struct file *file;
bool use_registered_ring;

use_registered_ring = !!(opcode & IORING_REGISTER_USE_REGISTERED_RING);
@@ -4637,27 +4637,27 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
if (unlikely(!tctx || fd >= IO_RINGFD_REG_MAX))
return -EINVAL;
fd = array_index_nospec(fd, IO_RINGFD_REG_MAX);
- f.file = tctx->registered_rings[fd];
- f.flags = 0;
- if (unlikely(!f.file))
+ file = tctx->registered_rings[fd];
+ if (unlikely(!file))
return -EBADF;
} else {
- f = fdget(fd);
- if (unlikely(!f.file))
+ file = fget(fd);
+ if (unlikely(!file))
return -EBADF;
ret = -EOPNOTSUPP;
- if (!io_is_uring_fops(f.file))
+ if (!io_is_uring_fops(file))
goto out_fput;
}

- ctx = f.file->private_data;
+ ctx = file->private_data;

mutex_lock(&ctx->uring_lock);
ret = __io_uring_register(ctx, opcode, arg, nr_args);
mutex_unlock(&ctx->uring_lock);
trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs, ret);
out_fput:
- fdput(f);
+ if (!use_registered_ring)
+ fput(file);
return ret;
}


--
Jens Axboe