Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd

From: Pavel Begunkov
Date: Thu Feb 03 2022 - 18:26:52 EST


On 2/3/22 22:16, Jens Axboe wrote:
On 2/3/22 2:47 PM, Pavel Begunkov wrote:
On 2/3/22 19:54, Usama Arif wrote:
On 03/02/2022 19:06, Jens Axboe wrote:
On 2/3/22 12:00 PM, Pavel Begunkov wrote:
On 2/3/22 18:29, Jens Axboe wrote:
On 2/3/22 11:26 AM, Usama Arif wrote:
Hmm, maybe i didn't understand you and Pavel correctly. Are you
suggesting to do the below diff over patch 3? I dont think that would be
correct, as it is possible that just after checking if ctx->io_ev_fd is
present unregister can be called by another thread and set ctx->io_ev_fd
to NULL that would cause a NULL pointer exception later? In the current
patch, the check of whether ev_fd exists happens as the first thing
after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.

They are cheap, but they are still noticeable at high requests/sec
rates. So would be best to avoid them.

And yes it's obviously racy, there's the potential to miss an eventfd
notification if it races with registering an eventfd descriptor. But
that's not really a concern, as if you register with inflight IO
pending, then that always exists just depending on timing. The only
thing I care about here is that it's always _safe_. Hence something ala
what you did below is totally fine, as we're re-evaluating under rcu
protection.

Indeed, the patch doesn't have any formal guarantees for propagation
to already inflight requests, so this extra unsynchronised check
doesn't change anything.

I'm still more сurious why we need RCU and extra complexity when
apparently there is no use case for that. If it's only about
initial initialisation, then as I described there is a much
simpler approach.

Would be nice if we could get rid of the quiesce code in general, but I
haven't done a check to see what'd be missing after this...


I had checked! I had posted below in in reply to v1 (https://lore.kernel.org/io-uring/02fb0bc3-fc38-b8f0-3067-edd2a525ef29@xxxxxxxxx/T/#m5ac7867ac61d86fe62c099be793ffe5a9a334976), but i think it got missed! Copy-pasting here for reference:

May have missed it then, apologies

"
I see that if we remove ring quiesce from the the above 3 opcodes, then
only IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS is
left for ring quiesce. I just had a quick look at those, and from what i
see we might not need to enter ring quiesce in
IORING_REGISTER_ENABLE_RINGS as the ring is already disabled at that point?
And for IORING_REGISTER_RESTRICTIONS if we do a similar approach to
IORING_REGISTER_EVENTFD, i.e. wrap ctx->restrictions inside an RCU
protected data structure, use spin_lock to prevent multiple
io_register_restrictions calls at the same time, and use read_rcu_lock
in io_check_restriction, then we can remove ring quiesce from
io_uring_register altogether?

My usecase only uses IORING_REGISTER_EVENTFD, but i think entering ring
quiesce costs similar in other opcodes. If the above sounds reasonable,
please let me know and i can send patches for removing ring quiesce for
io_uring_register.
"

Let me know if above makes sense, i can add patches on top of the current patchset, or we can do it after they get merged.

As for why, quiesce state is very expensive. its making io_uring_register the most expensive syscall in my usecase (~15ms) compared to ~0.1ms now with RCU, which is why i started investigating this. And this patchset avoids ring quiesce for 3 of the opcodes, so it would generally be quite helpful if someone does registers and unregisters eventfd multiple times.

I agree that 15ms for initial setup is silly and it has to be
reduced. However, I'm trying weight the extra complexity against
potential benefits of _also_ optimising [de,re]-registration

Considering that you only register it one time at the beginning,
we risk adding a yet another feature that nobody is going to ever
use. This doesn't give me a nice feeling, well, unless you do
have a use case.

It's not really a new feature, it's just making the existing one not
suck quite as much...

Does it matter when nobody uses it? My point is that does not.


To emphasise, I'm comparing 15->0.1 improvement for only initial
registration (which is simpler) vs 15->0.1 for both registration
and unregistration.

reg+unreg should be way faster too, if done properly with the assignment
tricks.

fwiw, it alters userpace visible behaviour in either case, shouldn't
be as important here but there is always a chance to break userspace

It doesn't alter userspace behavior, if the registration works like I
described with being able to assign a new one while the old one is being
torn down.

Or do you mean wrt inflight IO? I don't think the risk is very high
there, to be honest.

Right, if somebody tries such a trick it'll be pretty confusing to
get randomly firing eventfd, though it's rather a marginal case.

--
Pavel Begunkov