Re: [net-next 0/3] Per epoll context busy poll support

From: Joe Damato
Date: Wed Jan 24 2024 - 09:20:27 EST


On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote:
> On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@xxxxxxxxxx> wrote:
> >
> > Greetings:
> >
> > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> > epoll with socket fds.") by allowing user applications to enable
> > epoll-based busy polling and set a busy poll packet budget on a per epoll
> > context basis.
> >
> > To allow for this, two ioctls have been added for epoll contexts for
> > getting and setting a new struct, struct epoll_params.
> >
> > This makes epoll-based busy polling much more usable for user
> > applications than the current system-wide sysctl and hardcoded budget.
> >
> > Longer explanation:
> >
> > Presently epoll has support for a very useful form of busy poll based on
> > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
> >
> > This form of busy poll allows epoll_wait to drive NAPI packet processing
> > which allows for a few interesting user application designs which can
> > reduce latency and also potentially improve L2/L3 cache hit rates by
> > deferring NAPI until userland has finished its work.
> >
> > The documentation available on this is, IMHO, a bit confusing so please
> > allow me to explain how one might use this:
> >
> > 1. Ensure each application thread has its own epoll instance mapping
> > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> > direct connections with specific dest ports to these queues.
> >
> > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> > polling will occur. This can help avoid the userland app from being
> > pre-empted by a hard IRQ while userland is running. Note this means that
> > userland must take care to call epoll_wait and not take too long in
> > userland since it now drives NAPI via epoll_wait.
> >
> > 3. Ensure that all incoming connections added to an epoll instance
> > have the same NAPI ID. This can be done with a BPF filter when
> > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> > accept thread is used which dispatches incoming connections to threads.
> >
> > 4. Lastly, busy poll must be enabled via a sysctl
> > (/proc/sys/net/core/busy_poll).
> >
> > The unfortunate part about step 4 above is that this enables busy poll
> > system-wide which affects all user applications on the system,
> > including epoll-based network applications which were not intended to
> > be used this way or applications where increased CPU usage for lower
> > latency network processing is unnecessary or not desirable.
> >
> > If the user wants to run one low latency epoll-based server application
> > with epoll-based busy poll, but would like to run the rest of the
> > applications on the system (which may also use epoll) without busy poll,
> > this system-wide sysctl presents a significant problem.
> >
> > This change preserves the system-wide sysctl, but adds a mechanism (via
> > ioctl) to enable or disable busy poll for epoll contexts as needed by
> > individual applications, making epoll-based busy poll more usable.
> >
>
> I think this description missed the napi_defer_hard_irqs and
> gro_flush_timeout settings ?

I'm not sure if those settings are strictly related to the change I am
proposing which makes epoll-based busy poll something that can be
enabled/disabled on a per-epoll context basis and allows the budget to be
set as well, but maybe I am missing something? Sorry for my
misunderstanding if so.

IMHO: a single system-wide busy poll setting is difficult to use
properly and it is unforunate that the packet budget is hardcoded. It would
be extremely useful to be able to set both of these on a per-epoll basis
and I think my suggested change helps to solve this.

Please let me know.

Re the two settings you noted:

I didn't mention those in the interest of brevity, but yes they can be used
instead of or in addition to what I've described above.

While those settings are very useful, IMHO, they have their own issues
because they are system-wide as well. If they were settable per-NAPI, that
would make it much easier to use them because they could be enabled for the
NAPIs which are being busy-polled by applications that support busy-poll.

Imagine you have 3 types of apps running side-by-side:
- A low latency epoll-based busy poll app,
- An app where latency doesn't matter as much, and
- A latency sensitive legacy app which does not yet support epoll-based
busy poll.

In the first two cases, the settings you mention would be helpful or not
make any difference, but in the third case the system-wide impact might be
undesirable because having IRQs fire might be important to keep latency
down.

If your comment was more that my cover letter should have mentioned these,
I can include that in a future cover letter or suggest some kernel
documentation which will discuss all of these features and how they relate
to each other.

>
> I would think that if an application really wants to make sure its
> thread is the only one
> eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
>
> Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> returning to user space.
>
> It seems you want to make sure the application and only the
> application calls the napi->poll()
> at chosen times.
>
> Some kind of contract is needed, and the presence of the hrtimer
> (currently only driven from dev->@gro_flush_timeout)
> would allow to do that correctly.
>
> Whenever we 'trust' user space to perform the napi->poll shortly, we
> also want to arm the hrtimer to eventually detect
> the application took too long, to restart the other mechanisms (NIC irq based)

There is another change [1] I've been looking at from a research paper [2]
which does something similar to what you've described above -- it keeps
IRQs suppressed during busy polling. The paper suggests a performance
improvement is measured when using a mechanism like this to keep IRQs off.
Please see the paper for more details.

I haven't had a chance to reach out to the authors or to tweak this patch
to attempt an RFC / submission for it, but it seems fairly promising in my
initial synthetic tests.

When I tested their patch, as you might expect, no IRQs were generated at
all for the NAPIs that were being busy polled, but the rest of the
NAPIs and queues were generating IRQs as expected.

Regardless of the above patch: I think my proposed change is helpful and
the IRQ suppression bit can be handled in a separate change in the future.
What do you think?

> Note that we added the kthread based napi polling, and we are working
> to add a busy polling feature to these kthreads.
> allowing to completely mask NIC interrupts and further reduce latencies.

I am aware of kthread based NAPI polling, yes, but I was not aware that
busy polling was being considered as a feature for them, thanks for the
head's up.

> Thank you

Thanks for your comments - I appreciate your time and attention.

Could you let me know if your comments are meant as a n-ack or similar?

I am unsure if you were suggesting that per-epoll context based busy
polling is unneeded/unnecessary from your perspective - or if it was more
of a hint that I should be including more context somewhere in the kernel
documentation as part of this change :)

Again, IMHO, allowing epoll based busy polling to be configured on a
per-epoll context basis (both the usecs and the packet budget) really help
to make epoll-based busy polling much more usable by user apps.

Thanks,
Joe

[1]: https://gitlab.uwaterloo.ca/p5cai/netstack-exp/-/raw/master/kernel-polling-5.15.79-base.patch?ref_type=heads
[2]: https://dl.acm.org/doi/pdf/10.1145/3626780