Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

From: Stefan Metzmacher
Date: Mon Aug 15 2022 - 09:31:38 EST


Hi Pavel,

Thanks for giving a thought about the API, are you trying
to use it in samba?

Yes, but I'd need SENDMSGZC and then I'd like to test,
which variant gives the best performance. It also depends
on the configured samba vfs module stack.

I can send you a branch this week if you would be
willing to try it out as I'll be sending the "msg" variant
only for 5.21

I'm not sure I'll have time to do runtime testing,
but it would be great to have a look at the code and give some comments
based on that.

I think it should be:

                   if (up->arg)
                           slot->tag = up->arg;
                   if (!slot->notif)
                           continue;
                   io_notif_slot_flush_submit(slot, issue_flags);

or even:

                   slot->tag = up->arg;
                   if (!slot->notif)
                           continue;
                   io_notif_slot_flush_submit(slot, issue_flags);

otherwise IORING_RSRC_UPDATE_NOTIF would not be able to reset the tag,
if notif was never created or already be flushed.

Ah, you want to update it for later. The idea was to affect only
those notifiers that are flushed by this update.
...

notif->cqe.user_data = slot->tag; happens in io_alloc_notif(),
so the slot->tag = up->arg; here is always for the next IO_SENDZC.

With IORING_RSRC_UPDATE_NOTIF linked to a IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)
I basically try to reset slot->tag to the same (or related) user_data as the SENDZC itself.

So that each SENDZC generates two CQEs with the same user_data belonging
to the same userspace buffer.


I had a similar chat with Dylan last week. I'd rather not rob SQE of
additional u64 as there is only addr3 left and then we're fully packed,
but there is another option we were thinking about based on OVERRIDE_TAG
feature I scrapped from the final version of zerocopy patches.

Long story short, the idea is to copy req->cqe.user_data of a
send(+flush) request into the notification CQE, so you'll get 2 CQEs
with identical user_data but they can be distinguished by looking at
cqe->flags.

What do you think? Would it work for you?

I guess that would work.

I'm also wondering what will happen if a notif will be referenced by the net layer
but the io_uring instance is already closed, wouldn't
io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
because notif->ctx is a stale pointer, of notif itself is already gone...

io_uring will flush all slots and wait for all notifications
to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
problem.

I can't follow :-(

What I see is that io_notif_unregister():

                 nd = io_notif_to_data(notif);
                 slot->notif = NULL;
                 if (!refcount_dec_and_test(&nd->uarg.refcnt))
                         continue;

So if the net layer still has a reference we just go on.

Only a wild guess, is it something of:

io_alloc_notif():
         ...
         notif->task = current;
         io_get_task_refs(1);
         notif->rsrc_node = NULL;
         io_req_set_rsrc_node(notif, ctx, 0);
         ...

and

__io_req_complete_put():
                 ...
                 io_req_put_rsrc(req);
                 /*
                  * Selected buffer deallocation in io_clean_op() assumes that
                  * we don't hold ->completion_lock. Clean them here to avoid
                  * deadlocks.
                  */
                 io_put_kbuf_comp(req);
                 io_dismantle_req(req);
                 io_put_task(req->task, 1);
                 ...

that causes io_ring_exit_work() to wait for it.> It would be great if you or someone else could explain this in detail
and maybe adding some comments into the code.

Almost, the mechanism is absolutely the same as with requests,
and notifiers are actually requests for internal purposes.

In __io_alloc_req_refill() we grab ctx->refs, which are waited
for in io_ring_exit_work(). We usually put requests into a cache,
so when a request is complete we don't put the ref and therefore
in io_ring_exit_work() we also have a call to io_req_caches_free(),
which puts ctx->refs.

Ok, thanks.

Would a close() on the ring fd block? I guess not, but the exit_work may block, correct?
So a process would be a zombie until net released all references?

metze