Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode

From: Jason Baron
Date: Mon Oct 07 2019 - 14:44:37 EST




On 10/7/19 2:30 PM, Roman Penyaev wrote:
> On 2019-10-07 18:42, Jason Baron wrote:
>> On 10/7/19 6:54 AM, Roman Penyaev wrote:
>>> On 2019-10-03 18:13, Jason Baron wrote:
>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
>>>>> On 2019-09-28 04:29, Andrew Morton wrote:
>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@xxxxxx> wrote:
>>>>>>
>>>>>>> From: Heiher <r@xxxxxx>
>>>>>>>
>>>>>>> Take the case where we have:
>>>>>>>
>>>>>>> ÂÂÂÂÂÂÂ t0
>>>>>>> ÂÂÂÂÂÂÂÂ | (ew)
>>>>>>> ÂÂÂÂÂÂÂ e0
>>>>>>> ÂÂÂÂÂÂÂÂ | (et)
>>>>>>> ÂÂÂÂÂÂÂ e1
>>>>>>> ÂÂÂÂÂÂÂÂ | (lt)
>>>>>>> ÂÂÂÂÂÂÂ s0
>>>>>>>
>>>>>>> t0: thread 0
>>>>>>> e0: epoll fd 0
>>>>>>> e1: epoll fd 1
>>>>>>> s0: socket fd 0
>>>>>>> ew: epoll_wait
>>>>>>> et: edge-trigger
>>>>>>> lt: level-trigger
>>>>>>>
>>>>>>> We only need to wakeup nested epoll fds if something has been queued
>>>>>>> to the
>>>>>>> overflow list, since the ep_poll() traverses the rdllist during
>>>>>>> recursive poll
>>>>>>> and thus events on the overflow list may not be visible yet.
>>>>>>>
>>>>>>> Test code:
>>>>>>
>>>>>> Look sane to me. Do you have any performance testing results which
>>>>>> show a benefit?
>>>>>>
>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
>>>>>> Roman, would you please have time to review this?
>>>>>
>>>>> So here is my observation: current patch does not fix the described
>>>>> problem (double wakeup) for the case, when new event comes exactly
>>>>> to the ->ovflist. According to the patch this is the desired
>>>>> intention:
>>>>>
>>>>> ÂÂ /*
>>>>> ÂÂÂ * We only need to wakeup nested epoll fds if something has been
>>>>> queued
>>>>> ÂÂÂ * to the overflow list, since the ep_poll() traverses the rdllist
>>>>> ÂÂÂ * during recursive poll and thus events on the overflow list may
>>>>> not be
>>>>> ÂÂÂ * visible yet.
>>>>> ÂÂÂ */
>>>>> ÂÂÂ if (nepi != NULL)
>>>>> ÂÂÂÂÂÂ pwake++;
>>>>>
>>>>> ÂÂÂ ....
>>>>>
>>>>> ÂÂÂ if (pwake == 2)
>>>>> ÂÂÂÂÂÂ ep_poll_safewake(&ep->poll_wait);
>>>>>
>>>>>
>>>>> but this actually means that we repeat the same behavior (double
>>>>> wakeup)
>>>>> but only for the case, when event comes to the ->ovflist.
>>>>>
>>>>> How to reproduce? Can be easily done (ok, not so easy but it is
>>>>> possible
>>>>> to try): to the given userspace test we need to add one more socket
>>>>> and
>>>>> immediately fire the event:
>>>>>
>>>>> ÂÂÂ e.events = EPOLLIN;
>>>>> ÂÂÂ if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>>>>> ÂÂÂÂÂÂ goto out;
>>>>>
>>>>> ÂÂÂ /*
>>>>> ÂÂÂÂ * Signal any fd to let epoll_wait() to call ep_scan_ready_list()
>>>>> ÂÂÂÂ * in order to "catch" it there and add new event to ->ovflist.
>>>>> ÂÂÂÂ */
>>>>> ÂÂÂÂ if (write(s2fd[1], "w", 1) != 1)
>>>>> ÂÂÂÂÂÂÂ goto out;
>>>>>
>>>>> That is done in order to let the following epoll_wait() call to invoke
>>>>> ep_scan_ready_list(), where we can "catch" and insert new event
>>>>> exactly
>>>>> to the ->ovflist. In order to insert event exactly in the correct list
>>>>> I introduce artificial delay.
>>>>>
>>>>> Modified test and kernel patch is below. Here is the output of the
>>>>> testing tool with some debug lines from kernel:
>>>>>
>>>>> Â # ~/devel/test/edge-bug
>>>>> Â [ÂÂ 59.263178] ### sleep 2
>>>>> Â >> write to sock
>>>>> Â [ÂÂ 61.318243] ### done sleep
>>>>> Â [ÂÂ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
>>>>> events_in_rdllist=1, events_in_ovflist=1
>>>>> Â [ÂÂ 61.321204] ### sleep 2
>>>>> Â [ÂÂ 63.398325] ### done sleep
>>>>> Â error: What?! Again?!
>>>>>
>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
>>>>> wanted to achieve, so eventually ep_poll_safewake() is called again
>>>>> which leads to double wakeup.
>>>>>
>>>>> In my opinion current patch as it is should be dropped, it does not
>>>>> fix the described problem but just hides it.
>>>>>
>>>>> --
>>>
>>> Hi Jason,
>>>
>>>>
>>>> Yes, there are 2 wakeups in the test case you describe, but if the
>>>> second event (write to s1fd) gets queued after the first call to
>>>> epoll_wait(), we are going to get 2 wakeups anyways.
>>>
>>> Yes, exactly, for this reason I print out the number of events observed
>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise
>>> this is another case, when second event comes exactly after first
>>> wait, which is legitimate wakeup.
>>>
>>>> So yes, there may
>>>> be a slightly bigger window with this patch for 2 wakeups, but its
>>>> small
>>>> and I tried to be conservative with the patch - I'd rather get an
>>>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups
>>>> isn't always fun...
>>>>
>>>> That said, the reason for propagating events that end up on the
>>>> overflow
>>>> list was to prevent the race of the wakee not seeing events because
>>>> they
>>>> were still on the overflow list. In the testcase, imagine if there
>>>> was a
>>>> thread doing epoll_wait() on efd[0], and then a write happends on s1fd.
>>>> I thought it was possible then that a 2nd thread doing epoll_wait() on
>>>> efd[1], wakes up, checks efd[0] and sees no events because they are
>>>> still potentially on the overflow list. However, I think that case is
>>>> not possible because the thread doing epoll_wait() on efd[0] is
>>>> going to
>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its
>>>> going
>>>> to have to be ordered because its also grabbing the ep->mtx associated
>>>> with efd[0].
>>>>
>>>> So I think its safe to do the following if we want to go further than
>>>> the proposed patch, which is what you suggested earlier in the thread
>>>> (minus keeping the wakeup on ep->wq).
>>>
>>> Then I do not understand why we need to keep ep->wq wakeup?
>>> @wq and @poll_wait are almost the same with only one difference:
>>> @wq is used when you sleep on it inside epoll_wait() and the other
>>> is used when you nest epoll fd inside epoll fd. Either you wake
>>> both up either you don't this at all.
>>>
>>> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify()
>>> do wakeup explicitly, so what are the cases when we need to do wakeups
>>> from ep_scan_ready_list()?
>>
>> Hi Roman,
>>
>> So the reason I was saying not to drop the ep->wq wakeup was that I was
>> thinking about a usecase where you have multi-threads say thread A and
>> thread B which are doing epoll_wait() on the same epfd. Now, the threads
>> both call epoll_wait() and are added as exclusive to ep->wq. Now a bunch
>> of events happen and thread A is woken up. However, thread A may only
>> process a subset of the events due to its 'maxevents' parameter. In that
>> case, I was thinking that the wakeup on ep->wq might be helpful, because
>> in the absence of subsequent events, thread B can now start processing
>> the rest, instead of waiting for the next event to be queued.
>>
>> However, I was thinking about the state of things before:
>> 86c0517 fs/epoll: deal with wait_queue only once
>>
>> Before that patch, thread A would have been removed from eq->wq before
>> the wakeup call, thus waking up thread B. However, now that thread A
>> stays on the queue during the call to ep_send_events(), I believe the
>> wakeup would only target thread A, which doesn't help since its already
>> checking for events. So given the state of things I think you are right
>> in that its not needed. However, I wonder if not removing from the
>> ep->wq affects the multi-threaded case I described. Its been around
>> since 5.0, so probably not, but it would be a more subtle performance
>> difference.
>
> Now I understand what you mean. You want to prevent "idling" of events,
> while thread A is busy with the small portion of events (portion is equal
> to maxevents). On next iteration thread A will pick up the rest, no
> doubts,
> but would be nice to give a chance to thread B immediately to deal with the
> rest. Ok, makes sense.

Exactly, I don't believe its racy as is - but it seems like it would be
good to wakeup other threads that may be waiting. That said, this logic
was removed as I pointed out. So I'm not sure we need to tie this change
to the current one - but it may be a nice addition.

>
> But what if to make this wakeup explicit if we have more events to process?
> (nothing is tested, just a guess)
>
> @@ -255,6 +255,7 @@ struct ep_pqueue {
> Âstruct ep_send_events_data {
> ÂÂÂÂÂÂÂ int maxevents;
> ÂÂÂÂÂÂÂ struct epoll_event __user *events;
> +ÂÂÂÂÂÂ bool have_more;
> ÂÂÂÂÂÂÂ int res;
> Â};
> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct
> eventpoll *ep, struct list_head *head
> Â}
>
> Âstatic int ep_send_events(struct eventpoll *ep,
> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct epoll_event __user *events, int maxevents)
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct epoll_event __user *events, int maxevents,
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ bool *have_more)
> Â{
> -ÂÂÂÂÂÂ struct ep_send_events_data esed;
> -
> -ÂÂÂÂÂÂ esed.maxevents = maxevents;
> -ÂÂÂÂÂÂ esed.events = events;
> +ÂÂÂÂÂÂ struct ep_send_events_data esed = {
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ .maxevents = maxevents,
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ .events = events,
> +ÂÂÂÂÂÂ };
>
> ÂÂÂÂÂÂÂ ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
> +ÂÂÂÂÂÂ *have_more = esed.have_more;
> +
> ÂÂÂÂÂÂÂ return esed.res;
> Â}
>
> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
> Â{
> ÂÂÂÂÂÂÂ int res = 0, eavail, timed_out = 0;
> ÂÂÂÂÂÂÂ u64 slack = 0;
> -ÂÂÂÂÂÂ bool waiter = false;
> +ÂÂÂÂÂÂ bool waiter = false, have_more;
> ÂÂÂÂÂÂÂ wait_queue_entry_t wait;
> ÂÂÂÂÂÂÂ ktime_t expires, *to = NULL;
>
> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
> ÂÂÂÂÂÂÂÂ * more luck.
> ÂÂÂÂÂÂÂÂ */
> ÂÂÂÂÂÂÂ if (!res && eavail &&
> -ÂÂÂÂÂÂÂÂÂÂ !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
> +ÂÂÂÂÂÂÂÂÂÂ !(res = ep_send_events(ep, events, maxevents, &have_more)) &&
> +ÂÂÂÂÂÂÂÂÂÂ !timed_out)
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ goto fetch_events;
>
> ÂÂÂÂÂÂÂ if (waiter) {
> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct
> epoll_event __user *events,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __remove_wait_queue(&ep->wq, &wait);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ spin_unlock_irq(&ep->wq.lock);
> ÂÂÂÂÂÂÂ }
> +ÂÂÂÂÂÂ /*
> +ÂÂÂÂÂÂÂ * We were not able to process all the events, so immediately
> +ÂÂÂÂÂÂÂ * wakeup other waiter.
> +ÂÂÂÂÂÂÂ */
> +ÂÂÂÂÂÂ if (res > 0 && have_more && waitqueue_active(&ep->wq))
> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ wake_up(&ep->wq);
>
> ÂÂÂÂÂÂÂ return res;
> Â}
>
>

Ok, yeah I like making it explicit. Looks like you are missing the
changes to ep_scan_ready_list(), but I think the general approach makes
sense. Although I really didn't have a test case that motivated this -
its just was sort of noting this change in behavior while reviewing the
current change.

> PS. So what we decide with the original patch? Remove the whole branch?
>

For fwiw, I'm ok removing the whole branch as you proposed. And I think
the above change can go in separately (if we decide we want it). I don't
think they need to be tied together. I also want to make sure this
change gets a full linux-next cycle, so I think it should target 5.5 at
this point.

Thanks,

-Jason