Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode

From: Jason Baron
Date: Mon Oct 07 2019 - 12:43:27 EST




On 10/7/19 6:54 AM, Roman Penyaev wrote:
> On 2019-10-03 18:13, Jason Baron wrote:
>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
>>> On 2019-09-28 04:29, Andrew Morton wrote:
>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@xxxxxx> wrote:
>>>>
>>>>> From: Heiher <r@xxxxxx>
>>>>>
>>>>> Take the case where we have:
>>>>>
>>>>> ÂÂÂÂÂÂÂ t0
>>>>> ÂÂÂÂÂÂÂÂ | (ew)
>>>>> ÂÂÂÂÂÂÂ e0
>>>>> ÂÂÂÂÂÂÂÂ | (et)
>>>>> ÂÂÂÂÂÂÂ e1
>>>>> ÂÂÂÂÂÂÂÂ | (lt)
>>>>> ÂÂÂÂÂÂÂ s0
>>>>>
>>>>> t0: thread 0
>>>>> e0: epoll fd 0
>>>>> e1: epoll fd 1
>>>>> s0: socket fd 0
>>>>> ew: epoll_wait
>>>>> et: edge-trigger
>>>>> lt: level-trigger
>>>>>
>>>>> We only need to wakeup nested epoll fds if something has been queued
>>>>> to the
>>>>> overflow list, since the ep_poll() traverses the rdllist during
>>>>> recursive poll
>>>>> and thus events on the overflow list may not be visible yet.
>>>>>
>>>>> Test code:
>>>>
>>>> Look sane to me. Do you have any performance testing results which
>>>> show a benefit?
>>>>
>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
>>>> Roman, would you please have time to review this?
>>>
>>> So here is my observation: current patch does not fix the described
>>> problem (double wakeup) for the case, when new event comes exactly
>>> to the ->ovflist. According to the patch this is the desired intention:
>>>
>>> ÂÂ /*
>>> ÂÂÂ * We only need to wakeup nested epoll fds if something has been
>>> queued
>>> ÂÂÂ * to the overflow list, since the ep_poll() traverses the rdllist
>>> ÂÂÂ * during recursive poll and thus events on the overflow list may
>>> not be
>>> ÂÂÂ * visible yet.
>>> ÂÂÂ */
>>> ÂÂÂ if (nepi != NULL)
>>> ÂÂÂÂÂÂ pwake++;
>>>
>>> ÂÂÂ ....
>>>
>>> ÂÂÂ if (pwake == 2)
>>> ÂÂÂÂÂÂ ep_poll_safewake(&ep->poll_wait);
>>>
>>>
>>> but this actually means that we repeat the same behavior (double wakeup)
>>> but only for the case, when event comes to the ->ovflist.
>>>
>>> How to reproduce? Can be easily done (ok, not so easy but it is possible
>>> to try): to the given userspace test we need to add one more socket and
>>> immediately fire the event:
>>>
>>> ÂÂÂ e.events = EPOLLIN;
>>> ÂÂÂ if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>>> ÂÂÂÂÂÂ goto out;
>>>
>>> ÂÂÂ /*
>>> ÂÂÂÂ * Signal any fd to let epoll_wait() to call ep_scan_ready_list()
>>> ÂÂÂÂ * in order to "catch" it there and add new event to ->ovflist.
>>> ÂÂÂÂ */
>>> ÂÂÂÂ if (write(s2fd[1], "w", 1) != 1)
>>> ÂÂÂÂÂÂÂ goto out;
>>>
>>> That is done in order to let the following epoll_wait() call to invoke
>>> ep_scan_ready_list(), where we can "catch" and insert new event exactly
>>> to the ->ovflist. In order to insert event exactly in the correct list
>>> I introduce artificial delay.
>>>
>>> Modified test and kernel patch is below. Here is the output of the
>>> testing tool with some debug lines from kernel:
>>>
>>> Â # ~/devel/test/edge-bug
>>> Â [ÂÂ 59.263178] ### sleep 2
>>> Â >> write to sock
>>> Â [ÂÂ 61.318243] ### done sleep
>>> Â [ÂÂ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
>>> events_in_rdllist=1, events_in_ovflist=1
>>> Â [ÂÂ 61.321204] ### sleep 2
>>> Â [ÂÂ 63.398325] ### done sleep
>>> Â error: What?! Again?!
>>>
>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
>>> wanted to achieve, so eventually ep_poll_safewake() is called again
>>> which leads to double wakeup.
>>>
>>> In my opinion current patch as it is should be dropped, it does not
>>> fix the described problem but just hides it.
>>>
>>> --
>
> Hi Jason,
>
>>
>> Yes, there are 2 wakeups in the test case you describe, but if the
>> second event (write to s1fd) gets queued after the first call to
>> epoll_wait(), we are going to get 2 wakeups anyways.
>
> Yes, exactly, for this reason I print out the number of events observed
> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise
> this is another case, when second event comes exactly after first
> wait, which is legitimate wakeup.
>
>> So yes, there may
>> be a slightly bigger window with this patch for 2 wakeups, but its small
>> and I tried to be conservative with the patch - I'd rather get an
>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups
>> isn't always fun...
>>
>> That said, the reason for propagating events that end up on the overflow
>> list was to prevent the race of the wakee not seeing events because they
>> were still on the overflow list. In the testcase, imagine if there was a
>> thread doing epoll_wait() on efd[0], and then a write happends on s1fd.
>> I thought it was possible then that a 2nd thread doing epoll_wait() on
>> efd[1], wakes up, checks efd[0] and sees no events because they are
>> still potentially on the overflow list. However, I think that case is
>> not possible because the thread doing epoll_wait() on efd[0] is going to
>> have the ep->mtx, and thus when the thread wakes up on efd[1], its going
>> to have to be ordered because its also grabbing the ep->mtx associated
>> with efd[0].
>>
>> So I think its safe to do the following if we want to go further than
>> the proposed patch, which is what you suggested earlier in the thread
>> (minus keeping the wakeup on ep->wq).
>
> Then I do not understand why we need to keep ep->wq wakeup?
> @wq and @poll_wait are almost the same with only one difference:
> @wq is used when you sleep on it inside epoll_wait() and the other
> is used when you nest epoll fd inside epoll fd. Either you wake
> both up either you don't this at all.
>
> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify()
> do wakeup explicitly, so what are the cases when we need to do wakeups
> from ep_scan_ready_list()?

Hi Roman,

So the reason I was saying not to drop the ep->wq wakeup was that I was
thinking about a usecase where you have multi-threads say thread A and
thread B which are doing epoll_wait() on the same epfd. Now, the threads
both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch
of events happen and thread A is woken up. However, thread A may only
process a subset of the events due to its 'maxevents' parameter. In that
case, I was thinking that the wakeup on ep->wq might be helpful, because
in the absence of subsequent events, thread B can now start processing
the rest, instead of waiting for the next event to be queued.

However, I was thinking about the state of things before:
86c0517 fs/epoll: deal with wait_queue only once

Before that patch, thread A would have been removed from eq->wq before
the wakeup call, thus waking up thread B. However, now that thread A
stays on the queue during the call to ep_send_events(), I believe the
wakeup would only target thread A, which doesn't help since its already
checking for events. So given the state of things I think you are right
in that its not needed. However, I wonder if not removing from the
ep->wq affects the multi-threaded case I described. Its been around
since 5.0, so probably not, but it would be a more subtle performance
difference.

Thanks,

-Jason




>
> I would still remove the whole branch:
>
>
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll
> *ep,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ void *priv, int depth, bool ep_locked)
> Â{
> ÂÂÂÂÂÂÂ __poll_t res;
> -ÂÂÂÂÂÂ int pwake = 0;
> ÂÂÂÂÂÂÂ struct epitem *epi, *nepi;
> ÂÂÂÂÂÂÂ LIST_HEAD(txlist);
>
> @@ -738,26 +737,11 @@ static __poll_t ep_scan_ready_list(struct
> eventpoll *ep,
> ÂÂÂÂÂÂÂÂ */
> ÂÂÂÂÂÂÂ list_splice(&txlist, &ep->rdllist);
> ÂÂÂÂÂÂÂ __pm_relax(ep->ws);
> -
> -ÂÂÂÂÂÂ if (!list_empty(&ep->rdllist)) {
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /*
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * Wake up (if active) both the eventpoll wait list and
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * the ->poll() wait list (delayed after we release the
> lock).
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (waitqueue_active(&ep->wq))
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ wake_up(&ep->wq);
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (waitqueue_active(&ep->poll_wait))
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ pwake++;
> -ÂÂÂÂÂÂ }
> ÂÂÂÂÂÂÂ write_unlock_irq(&ep->lock);
>
> ÂÂÂÂÂÂÂ if (!ep_locked)
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mutex_unlock(&ep->mtx);
>
> -ÂÂÂÂÂÂ /* We have to call this outside the lock */
> -ÂÂÂÂÂÂ if (pwake)
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_poll_safewake(&ep->poll_wait);
> -
> ÂÂÂÂÂÂÂ return res;
> Â}
>
> --
> Roman
>
>
>