Re: [PATCHSET v3 0/5] Add support for epoll min_wait

From: Jens Axboe
Date: Tue Nov 08 2022 - 09:09:46 EST

Next message: Song Zhang: "[PATCH v3] sched/fair: Introduce priority load balance for CFS"
Previous message: Krzysztof Kozlowski: "[PATCH] dt-bindings: pinctrl: qcom,msm8976: convert to dtschema"
In reply to: Stefan Hajnoczi: "Re: [PATCHSET v3 0/5] Add support for epoll min_wait"
Next in thread: Stefan Hajnoczi: "Re: [PATCHSET v3 0/5] Add support for epoll min_wait"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
>>> Hi Jens,
>>> NICs and storage controllers have interrupt mitigation/coalescing
>>> mechanisms that are similar.
>>
>> Yep
>>
>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
>>> (counter) value. When a completion occurs, the device waits until the
>>> timeout or until the completion counter value is reached.
>>>
>>> If I've read the code correctly, min_wait is computed at the beginning
>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
>>> completion.
>>>
>>> It makes me wonder which approach is more useful for applications. With
>>> the Aggregation Time approach applications can control how much extra
>>> latency is added. What do you think about that approach?
>>
>> We only tested the current approach, which is time noted from entry, not
>> from when the first event arrives. I suspect the nvme approach is better
>> suited to the hw side, the epoll timeout helps ensure that we batch
>> within xx usec rather than xx usec + whatever the delay until the first
>> one arrives. Which is why it's handled that way currently. That gives
>> you a fixed batch latency.
>
> min_wait is fine when the goal is just maximizing throughput without any
> latency targets.

That's not true at all, I think you're in different time scales than
this would be used for.

> The min_wait approach makes it hard to set a useful upper bound on
> latency because unlucky requests that complete early experience much
> more latency than requests that complete later.

As mentioned in the cover letter or the main patch, this is most useful
for the medium load kind of scenarios. For high load, the min_wait time
ends up not mattering because you will hit maxevents first anyway. For
the testing that we did, the target was 2-300 usec, and 200 usec was
used for the actual test. Depending on what the kind of traffic the
server is serving, that's usually not much of a concern. From your
reply, I'm guessing you're thinking of much higher min_wait numbers. I
don't think those would make sense. If your rate of arrival is low
enough that min_wait needs to be high to make a difference, then the
load is low enough anyway that it doesn't matter. Hence I'd argue that
it is indeed NOT hard to set a useful upper bound on latency, because
that is very much what min_wait is.

I'm happy to argue merits of one approach over another, but keep in mind
that this particular approach was not pulled out of thin air AND it has
actually been tested and verified successfully on a production workload.
This isn't a hypothetical benchmark kind of setup.

--
Jens Axboe

Next message: Song Zhang: "[PATCH v3] sched/fair: Introduce priority load balance for CFS"
Previous message: Krzysztof Kozlowski: "[PATCH] dt-bindings: pinctrl: qcom,msm8976: convert to dtschema"
In reply to: Stefan Hajnoczi: "Re: [PATCHSET v3 0/5] Add support for epoll min_wait"
Next in thread: Stefan Hajnoczi: "Re: [PATCHSET v3 0/5] Add support for epoll min_wait"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]