Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in ep_poll_callback

From: Jason Baron
Date: Wed Apr 29 2020 - 00:12:15 EST




On 4/28/20 2:10 PM, Roman Penyaev wrote:
> On 2020-04-27 22:38, Jason Baron wrote:
>> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
>>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <jbaron@xxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
>>>>> In the event that we add to ovflist, before 339ddb53d373 we would be
>>>>> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
>>>>> With that wakeup removed, if we add to ovflist here, we may never wake
>>>>> up. Rather than adding back the ep_scan_ready_list wakeup - which was
>>>>> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
>>>>
>>>> I'm just curious which 'wakeup' we are talking about here? There is:
>>>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
>>>> ep_poll_safewake(ep, epi). It seems to me that its only about the later
>>>> one being missing not both? Is your workload using nested epoll?
>>>>
>>>> If so, it might make sense to just do the later, since the point of
>>>> the original patch was to minimize unnecessary wakeups.
>>>
>>> The missing wake-ups were when we added to ovflist instead of rdllist.
>>> Both are "the ready list" together - so I'd think we'd want the same
>>> wakeups regardless of which specific list we added to.
>>> ep_poll_callback isn't nested specific?
>>>
>>
>> So I was thinking that ep_poll() would see these events on the
>> ovflist without an explicit wakeup, b/c the overflow list being active
>> implies that the ep_poll() path would add them to the rdllist in
>> ep_scan_read_list(). Thus, it will see the events either in the
>> current ep_poll() context or via a subsequent syscall to epoll_wait().
>>
>> However, there are other paths that can call ep_scan_ready_list() thus
>> I agree with you that both wakeups here are necessary.
>>
>> I do think are are still (smaller) potential races in ep_scan_ready_list()
>> where we have:
>>
>> ÂÂÂÂÂÂÂ write_lock_irq(&ep->lock);
>> ÂÂÂÂÂÂÂ list_splice_init(&ep->rdllist, &txlist);
>> ÂÂÂÂÂÂÂ WRITE_ONCE(ep->ovflist, NULL);
>> ÂÂÂÂÂÂÂ write_unlock_irq(&ep->lock);
>>
>> And in the ep_poll path we have:
>>
>> static inline int ep_events_available(struct eventpoll *ep)
>> {
>> ÂÂÂÂÂÂÂ return !list_empty_careful(&ep->rdllist) ||
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>> }
>>
>>
>> Seems to me that first bit needs to be the following, since
>> ep_events_available() is now checked in a lockless way:
>>
>>
>> ÂÂÂÂÂÂÂ write_lock_irq(&ep->lock);
>> ÂÂÂÂWRITE_ONCE(ep->ovflist, NULL);
>> ÂÂÂÂsmp_wmb();
>> ÂÂÂÂÂÂÂ list_splice_init(&ep->rdllist, &txlist);
>> ÂÂÂÂÂÂÂ write_unlock_irq(&ep->lock);
>
>
> Hi Jason,
>
> For the first chunk you refer the order seems irrelevant.
> Either you see something not empty, you go take the lock
> and then check lists under the lock, either you see all
> lists are empty.
>

Hi Roman,

It does matter. Let's say we have:

epfd1->epfd2->socket. And thread a is doing an
epoll_wait() on epfd1, and thread b is doing
epoll_wait on epfd2. then:

1) data comes in on socket

ep_poll_callback() wakes up threads a and b

2) thread a runs

ep_poll()
ep_scan_ready_list()
ep_send_events_proc()
ep_item_poll()
ep_scan_ready_list()
list_splice_init(&ep->rdllist, &txlist);

3) now thread b is running

ep_poll()
ep_events_available()
returns false
schedule_hrtimeout_range()

Thus, thread c has missed a wakeup and will never
get it.


Similarly, for the second chunk I referenced.

>>
>> And also this bit:
>>
>> ÂÂÂÂÂÂÂ WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>>
>> ÂÂÂÂÂÂÂ /*
>> ÂÂÂÂÂÂÂÂ * Quickly re-inject items left on "txlist".
>> ÂÂÂÂÂÂÂÂ */
>> ÂÂÂÂÂÂÂ list_splice(&txlist, &ep->rdllist);
>>
>> Should I think better be reversed as well to:
>>
>> list_splice(&txlist, &ep->rdllist);
>> smp_wmb();
>> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>
> But this one is much more interesting. I understand what you
> are trying to achieve: we can't leave both lists empty for the
> short period of time, if there is something left the caller
> of ep_events_available() should be able to see one of the lists
> is not empty, otherwise it can be too late.
>
> But the problem you've spotted is much worse. Some remains
> can be in the txlist (this can happen if caller of epoll_wait
> wants to harvest only 1 event, but there are more in the ->rdlist).
> And we again get the lost wakeup.
>
> Problem is reproduced by the test below. The idea is simple:
> we have 10 threads and 10 event fds. Each thread can harvest
> only 1 event. 1 producer fires all 10 events at once and waits
> that all 10 events will be observed by 10 threads.
>
> The fix is basically a revert of 339ddb53d373 with 1 major
> exception: we do wakeups from ep_scan_ready_list() but
> if txlist is not empty && if ep_scan_ready_list() is called
> from the routine, which sends events, not reads it
> (thus we protect ourselves from repeated wake ups)
>
> I will send the code a bit later.

This was discussed as part of the original discussion around
339ddb53d373: https://lkml.org/lkml/2019/10/7/905

The context was more a performance difference rather than
a semantic difference, but as I pointed out I believe that
behavior pre-dates the above commit and goes back to:
86c0517 fs/epoll: deal with wait_queue only once

There, since the thread is left on the waitqueue over the
ep_scan_ready_list() thus the ep_wakeup() (that was removed
in 339ddb53d373), would no longer wakeup other potential
waiters.

So since I think this behavior change goes back to 5.0 and there
really haven't been any reports, I don't think there are
too many apps relying on these semantics that your test
case is showing. It would be interesting to confirm that
your test does indeed succeed/fail before/after that patch.

Also, as part of that original discussion, you had a patch
that I think addresses this. I would be ok with that, in
addition to a patch to address the ordering issue I pointed
out. I can post a patch for the former, if you think this
plan makes sense?

Thanks,

-Jason


>
> --
> Roman
>
> ---- test -------
>
> enum {
> ÂÂÂÂEPOLL60_EVENTS_NR = 10,
> };
>
> struct epoll60_ctx {
> ÂÂÂÂvolatile int stopped;
> ÂÂÂÂint ready;
> ÂÂÂÂint waiters;
> ÂÂÂÂint epfd;
> ÂÂÂÂint evfd[EPOLL60_EVENTS_NR];
> };
>
> static inline int count_waiters(struct epoll60_ctx *ctx)
> {
> ÂÂÂÂreturn __atomic_load_n(&ctx->waiters, __ATOMIC_ACQUIRE);
> }
>
> static void *epoll60_wait_thread(void *ctx_)
> {
> ÂÂÂÂstruct epoll60_ctx *ctx = ctx_;
> ÂÂÂÂstruct epoll_event e;
> ÂÂÂÂuint64_t v;
> ÂÂÂÂint ret;
>
> ÂÂÂÂwhile (!ctx->stopped) {
> ÂÂÂÂÂÂÂ /* Mark we are ready */
> ÂÂÂÂÂÂÂ __atomic_fetch_add(&ctx->ready, 1, __ATOMIC_ACQUIRE);
>
> ÂÂÂÂÂÂÂ /* Start when all are ready */
> ÂÂÂÂÂÂÂ while (__atomic_load_n(&ctx->ready, __ATOMIC_ACQUIRE) &&
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ !ctx->stopped);
>
> ÂÂÂÂÂÂÂ /* Account this waiter */
> ÂÂÂÂÂÂÂ __atomic_fetch_add(&ctx->waiters, 1, __ATOMIC_ACQUIRE);
> again_wait:
> ÂÂÂÂÂÂÂ ret = epoll_wait(ctx->epfd, &e, 1, 1000);
> ÂÂÂÂÂÂÂ if (ret != 1) {
> ÂÂÂÂÂÂÂÂÂÂÂ /* Should be stopped, otherwise we lost wakeup */
> ÂÂÂÂÂÂÂÂÂÂÂ assert(ctx->stopped);
> ÂÂÂÂÂÂÂÂÂÂÂ return NULL;
> ÂÂÂÂÂÂÂ }
>
> ÂÂÂÂÂÂÂ ret = read(e.data.fd, &v, sizeof(v));
> ÂÂÂÂÂÂÂ if (ret != sizeof(v)) {
> ÂÂÂÂÂÂÂÂÂÂÂ /* Event was stollen by other thread */
> ÂÂÂÂÂÂÂÂÂÂÂ goto again_wait;
> ÂÂÂÂÂÂÂ }
> ÂÂÂÂÂÂÂ __atomic_fetch_sub(&ctx->waiters, 1, __ATOMIC_RELEASE);
> ÂÂÂÂ}
>
> ÂÂÂÂreturn NULL;
> }
>
> static inline unsigned long long msecs(void)
> {
> ÂÂÂÂstruct timespec ts;
> ÂÂÂÂunsigned long long msecs;
>
> ÂÂÂÂclock_gettime(CLOCK_REALTIME, &ts);
> ÂÂÂÂmsecs = ts.tv_sec * 1000ull;
> ÂÂÂÂmsecs += ts.tv_nsec / 1000000ull;
>
> ÂÂÂÂreturn msecs;
> }
>
> TEST(epoll60)
> {
> ÂÂÂÂstruct epoll60_ctx ctx = { 0 };
> ÂÂÂÂpthread_t waiters[ARRAY_SIZE(ctx.evfd)];
> ÂÂÂÂstruct epoll_event e;
> ÂÂÂÂint i, n, ret;
>
> ÂÂÂÂsignal(SIGUSR1, signal_handler);
>
> ÂÂÂÂctx.epfd = epoll_create1(0);
> ÂÂÂÂASSERT_GE(ctx.epfd, 0);
>
> ÂÂÂÂ/* Create event fds */
> ÂÂÂÂfor (i = 0; i < ARRAY_SIZE(ctx.evfd); i++) {
> ÂÂÂÂÂÂÂ ctx.evfd[i] = eventfd(0, EFD_NONBLOCK);
> ÂÂÂÂÂÂÂ ASSERT_GE(ctx.evfd[i], 0);
>
> ÂÂÂÂÂÂÂ e.events = EPOLLIN | EPOLLERR;
> ÂÂÂÂÂÂÂ e.data.fd = ctx.evfd[i];
> ÂÂÂÂÂÂÂ ASSERT_EQ(epoll_ctl(ctx.epfd, EPOLL_CTL_ADD, ctx.evfd[i], &e), 0);
> ÂÂÂÂ}
>
> ÂÂÂÂ/* Create waiter threads */
> ÂÂÂÂfor (i = 0; i < ARRAY_SIZE(waiters); i++)
> ÂÂÂÂÂÂÂ ASSERT_EQ(pthread_create(&waiters[i], NULL,
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ epoll60_wait_thread, &ctx), 0);
>
> ÂÂÂÂfor (i = 0; i < 300; i++) {
> ÂÂÂÂÂÂÂ uint64_t v = 1, ms;
>
> ÂÂÂÂÂÂÂ /* Wait for all to be ready */
> ÂÂÂÂÂÂÂ while (__atomic_load_n(&ctx.ready, __ATOMIC_ACQUIRE) !=
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ARRAY_SIZE(ctx.evfd))
> ÂÂÂÂÂÂÂÂÂÂÂ ;
>
> ÂÂÂÂÂÂÂ /* Steady, go */
> ÂÂÂÂÂÂÂ __atomic_fetch_sub(&ctx.ready, ARRAY_SIZE(ctx.evfd),
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __ATOMIC_ACQUIRE);
>
> ÂÂÂÂÂÂÂ /* Wait all have gone to kernel */
> ÂÂÂÂÂÂÂ while (count_waiters(&ctx) != ARRAY_SIZE(ctx.evfd))
> ÂÂÂÂÂÂÂÂÂÂÂ ;
>
> ÂÂÂÂÂÂÂ /* 1ms should be enough to schedule out */
> ÂÂÂÂÂÂÂ usleep(1000);
>
> ÂÂÂÂÂÂÂ /* Quickly signal all handles at once */
> ÂÂÂÂÂÂÂ for (n = 0; n < ARRAY_SIZE(ctx.evfd); n++) {
> ÂÂÂÂÂÂÂÂÂÂÂ ret = write(ctx.evfd[n], &v, sizeof(v));
> ÂÂÂÂÂÂÂÂÂÂÂ ASSERT_EQ(ret, sizeof(v));
> ÂÂÂÂÂÂÂ }
>
> ÂÂÂÂÂÂÂ /* Busy loop for 1s and wait for all waiters to wake up */
> ÂÂÂÂÂÂÂ ms = msecs();
> ÂÂÂÂÂÂÂ while (count_waiters(&ctx) && msecs() < ms + 3000)
> ÂÂÂÂÂÂÂÂÂÂÂ ;
>
> ÂÂÂÂÂÂÂ ASSERT_EQ(count_waiters(&ctx), 0);
> ÂÂÂÂ}
> ÂÂÂÂctx.stopped = 1;
> ÂÂÂÂ/* Stop waiters */
> ÂÂÂÂfor (i = 0; i < ARRAY_SIZE(waiters); i++) {
> ÂÂÂÂÂÂÂ pthread_kill(waiters[i], SIGUSR1);
> ÂÂÂÂÂÂÂ pthread_join(waiters[i], NULL);
> ÂÂÂÂ}
>
> ÂÂÂÂfor (i = 0; i < ARRAY_SIZE(waiters); i++)
> ÂÂÂÂÂÂÂ close(ctx.evfd[i]);
> ÂÂÂÂclose(ctx.epfd);
> }
>
>