Weird issue with epoll and kernel >= 5.0

From: Omar Kilani
Date: Sat Mar 28 2020 - 14:12:28 EST


Hi there,

I've observed an issue with epoll and kernels 5.0 and above when a
system is generating a lot of epoll events.

I see this issue with nginx and jvm / netty based apps (using the
jvm's native epoll support as well as netty's own optimized epoll
support) but *not* with haproxy (?).

I'm not really sure what the actual problem is (nginx complains about
epoll_wait with a generic error), but it doesn't happen on 4.19.x and
lower.

I thought it was a netty problem at first and opened this ticket:

https://github.com/netty/netty/issues/8999

But then saw the same issue in nginx.

I haven't debugged a kernel issue in something like 20 years so I'm
not really sure where to start myself.

I'd be more than happy to provide my test case that has a very quick
repro to anyone who needs it.

Also happy to provide a VM/machine with enough CPUs to trigger it
easily (it seems to happen quicker with more CPUs present) to test
with.

Thanks!

Regards,
Omar