Re: [PATCHSET v3 0/5] Add support for epoll min_wait

From: Stefan Hajnoczi
Date: Tue Nov 08 2022 - 15:30:42 EST


On Tue, Nov 08, 2022 at 10:28:37AM -0700, Jens Axboe wrote:
> On 11/8/22 10:24 AM, Stefan Hajnoczi wrote:
> > On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
> >> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> >>> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> >>>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> >>>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >>>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>>>>>> Hi Jens,
> >>>>>>> NICs and storage controllers have interrupt mitigation/coalescing
> >>>>>>> mechanisms that are similar.
> >>>>>>
> >>>>>> Yep
> >>>>>>
> >>>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>>>>>> (counter) value. When a completion occurs, the device waits until the
> >>>>>>> timeout or until the completion counter value is reached.
> >>>>>>>
> >>>>>>> If I've read the code correctly, min_wait is computed at the beginning
> >>>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>>>>>> completion.
> >>>>>>>
> >>>>>>> It makes me wonder which approach is more useful for applications. With
> >>>>>>> the Aggregation Time approach applications can control how much extra
> >>>>>>> latency is added. What do you think about that approach?
> >>>>>>
> >>>>>> We only tested the current approach, which is time noted from entry, not
> >>>>>> from when the first event arrives. I suspect the nvme approach is better
> >>>>>> suited to the hw side, the epoll timeout helps ensure that we batch
> >>>>>> within xx usec rather than xx usec + whatever the delay until the first
> >>>>>> one arrives. Which is why it's handled that way currently. That gives
> >>>>>> you a fixed batch latency.
> >>>>>
> >>>>> min_wait is fine when the goal is just maximizing throughput without any
> >>>>> latency targets.
> >>>>
> >>>> That's not true at all, I think you're in different time scales than
> >>>> this would be used for.
> >>>>
> >>>>> The min_wait approach makes it hard to set a useful upper bound on
> >>>>> latency because unlucky requests that complete early experience much
> >>>>> more latency than requests that complete later.
> >>>>
> >>>> As mentioned in the cover letter or the main patch, this is most useful
> >>>> for the medium load kind of scenarios. For high load, the min_wait time
> >>>> ends up not mattering because you will hit maxevents first anyway. For
> >>>> the testing that we did, the target was 2-300 usec, and 200 usec was
> >>>> used for the actual test. Depending on what the kind of traffic the
> >>>> server is serving, that's usually not much of a concern. From your
> >>>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> >>>> don't think those would make sense. If your rate of arrival is low
> >>>> enough that min_wait needs to be high to make a difference, then the
> >>>> load is low enough anyway that it doesn't matter. Hence I'd argue that
> >>>> it is indeed NOT hard to set a useful upper bound on latency, because
> >>>> that is very much what min_wait is.
> >>>>
> >>>> I'm happy to argue merits of one approach over another, but keep in mind
> >>>> that this particular approach was not pulled out of thin air AND it has
> >>>> actually been tested and verified successfully on a production workload.
> >>>> This isn't a hypothetical benchmark kind of setup.
> >>>
> >>> Fair enough. I just wanted to make sure the syscall interface that gets
> >>> merged is as useful as possible.
> >>
> >> That is indeed the main discussion as far as I'm concerned - syscall,
> >> ctl, or both? At this point I'm inclined to just push forward with the
> >> ctl addition. A new syscall can always be added, and if we do, then it'd
> >> be nice to make one that will work going forward so we don't have to
> >> keep adding epoll_wait variants...
> >
> > epoll_wait3() would be consistent with how maxevents and timeout work.
> > It does not suffer from extra ctl syscall overhead when applications
> > need to change min_wait.
> >
> > The way the current patches add min_wait into epoll_ctl() seems hacky to
> > me. struct epoll_event was meant for file descriptor event entries. It
> > won't necessarily be large enough for future extensions (luckily
> > min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
> > ioctl()/setsockopt()-style interface, which is bad for anything that
> > needs to understand syscalls, like seccomp. A properly typed
> > epoll_wait3() seems cleaner to me.
>
> The ctl method is definitely a bit of an oddball. I've highlighted why
> I went that way in earlier emails, but in summary:
>
> - Makes it easy to adopt, just adding two lines at init time.
>
> - Moves detection of availability to init time as well, rather than
> the fast path.

Add an epoll_create1() flag to test for availability?

> I don't think anyone would want to often change the wait, it's
> something you'd set at init time. If you often want to change values
> for some reason, then obviously a syscall parameter would be a lot
> better.
>
> epoll_pwait3() would be vastly different than the other ones, simply
> because epoll_pwait2() is already using the maximum number of args.
> We'd need to add an epoll syscall struct at that point, probably
> with flags telling us if signal_struct or timeout is actually valid.

Yes :/.

> This is not to say I don't think we should add a syscall interface,
> just some of the arguments pro and con from having actually looked
> at it.
>
> --
> Jens Axboe
>
>

Attachment: signature.asc
Description: PGP signature