[PATCH net-next v2] netlink: introduce netlink poll to resolve fast return issue

From: Jong eon Park
Date: Tue Nov 14 2023 - 04:08:15 EST


In very rare cases, there was an issue where a user's 'poll' function
waiting for a uevent would continuously return very quickly, causing
excessive CPU usage due to the following scenario.

When sk_rmem_alloc exceeds sk_rcvbuf, netlink_broadcast_deliver returns an
error and netlink_overrun is called. However, if netlink_overrun was
called in a context just before a another context returns from the 'poll'
and 'recv' is invoked, emptying the rcv queue, sk->sk_err = ENOBUF is
written to the netlink socket belatedly and it enters the
NETLINK_S_CONGESTED state. If the user does not check for POLLERR, they
cannot consume and clean sk_err and repeatedly enter the situation where
they call 'poll' again but return immediately. Moreover, in this
situation, rcv queue is already empty and NETLINK_S_CONGESTED flag
prevents any more incoming packets. This makes it impossible for the user
to call 'recv'.

This "congested" situation is a bit ambiguous. The queue is empty, yet
'congested' remains. This means kernel can no longer deliver uevents
despite the empty queue, and it lead to the persistent 'congested' status.

------------CPU1 (kernel)---------- --------------CPU2 (app)--------------
...
a driver delivers uevent. poll was waiting for schedule.
a driver delivers uevent.
a driver delivers uevent.
...
1) netlink_broadcast_deliver fails.
(sk_rmem_alloc > sk_rcvbuf)
getting schedule and poll returns,
and the app calls recv.
(rcv queue is empied)
2)

netlink_overrun is called.
(NETLINK_S_CONGESTED flag is set,
ENOBUF is written in sk_err and,
wake up poll.)
finishing its job and call poll.
poll returns POLLERR.

(the app doesn't have POLLERR handler)
it calls poll, but getting POLLERR.
it calls poll, but getting POLLERR.
it calls poll, but getting POLLERR.
...

To address this issue, I would like to introduce the following netlink
poll.

After calling the datagram_poll, netlink poll checks the
NETLINK_S_CONGESTED status and rcv queue, and this make the user to be
readable once more even if the user has already emptied rcv queue. This
allows the user to be able to consume sk->sk_err value through
netlink_recvmsg, thus the situation described above can be avoided.

Co-developed-by: Dong ha Kang <dongha7.kang@xxxxxxxxxxx>
Signed-off-by: Jong eon Park <jongeon.park@xxxxxxxxxxx>

---
v2:
- Add more detailed explanation.
---
net/netlink/af_netlink.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index eb086b06d60d..f08c10220041 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2002,6 +2002,20 @@ static int netlink_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
return err ? : copied;
}

+static __poll_t netlink_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ __poll_t mask = datagram_poll(file, sock, wait);
+ struct sock *sk = sock->sk;
+ struct netlink_sock *nlk = nlk_sk(sk);
+
+ if (test_bit(NETLINK_S_CONGESTED, &nlk->state) &&
+ skb_queue_empty_lockless(&sk->sk_receive_queue))
+ mask |= EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
static void netlink_data_ready(struct sock *sk)
{
BUG();
@@ -2803,7 +2817,7 @@ static const struct proto_ops netlink_ops = {
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname = netlink_getname,
- .poll = datagram_poll,
+ .poll = netlink_poll,
.ioctl = netlink_ioctl,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
--
2.25.1