Re: [PATCH] eventfd: support delayed wakeup for non-semaphore eventfd to reduce cpu utilization

From: Jens Axboe
Date: Wed Apr 19 2023 - 12:42:25 EST


On 4/19/23 3:12?AM, Christian Brauner wrote:
> On Tue, Apr 18, 2023 at 08:15:03PM -0600, Jens Axboe wrote:
>> On 4/17/23 10:32?AM, Wen Yang wrote:
>>>
>>> ? 2023/4/17 22:38, Jens Axboe ??:
>>>> On 4/16/23 5:31?AM, wenyang.linux@xxxxxxxxxxx wrote:
>>>>> From: Wen Yang <wenyang.linux@xxxxxxxxxxx>
>>>>>
>>>>> For the NON SEMAPHORE eventfd, if it's counter has a nonzero value,
>>>>> then a read(2) returns 8 bytes containing that value, and the counter's
>>>>> value is reset to zero. Therefore, in the NON SEMAPHORE scenario,
>>>>> N event_writes vs ONE event_read is possible.
>>>>>
>>>>> However, the current implementation wakes up the read thread immediately
>>>>> in eventfd_write so that the cpu utilization increases unnecessarily.
>>>>>
>>>>> By adding a configurable delay after eventfd_write, these unnecessary
>>>>> wakeup operations are avoided, thereby reducing cpu utilization.
>>>> What's the real world use case of this, and what would the expected
>>>> delay be there? With using a delayed work item for this, there's
>>>> certainly a pretty wide grey zone in terms of delay where this would
>>>> perform considerably worse than not doing any delayed wakeups at all.
>>>
>>>
>>> Thanks for your comments.
>>>
>>> We have found that the CPU usage of the message middleware is high in
>>> our environment, because sensor messages from MCU are very frequent
>>> and constantly reported, possibly several hundred thousand times per
>>> second. As a result, the message receiving thread is frequently
>>> awakened to process short messages.
>>>
>>> The following is the simplified test code:
>>> https://github.com/w-simon/tests/blob/master/src/test.c
>>>
>>> And the test code in this patch is further simplified.
>>>
>>> Finally, only a configuration item has been added here, allowing users
>>> to make more choices.
>>
>> I think you'd have a higher chance of getting this in if the delay
>> setting was per eventfd context, rather than a global thing.
>
> That patch seems really weird. Is that an established paradigm to
> address problems like this through a configured wakeup delay? Because
> naively this looks like a pretty brutal hack.

It is odd, and it is a brutal hack. My worries were outlined in an
earlier reply, there's quite a big gap where no delay would be better
and the delay approach would be miserable because it'd cause extra
latency and extra context switches. It'd be much cleaner if you KNEW
there'd be more events coming, as you could then get rid of that delayed
work item completely. And I suspect, if this patch makes sense, that
it'd be better to have a number+time limit as well and if you hit the
event number count that you'd notify inline and put some smarts in the
delayed work handling to just not do anything if nothing is pending.

--
Jens Axboe