Re: [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop

From: Christian Brauner
Date: Mon Aug 15 2022 - 05:44:45 EST


On Wed, Aug 10, 2022 at 07:08:39PM +0300, Alexander Mikhalitsyn wrote:
> From: "Denis V. Lunev" <den@xxxxxxxxxx>
>
> Normal processing of ARP request (usually this is Ethernet broadcast
> packet) coming to the host is looking like the following:
> * the packet comes to arp_process() call and is passed through routing
> procedure
> * the request is put into the queue using pneigh_enqueue() if
> corresponding ARP record is not local (common case for container
> records on the host)
> * the request is processed by timer (within 80 jiffies by default) and
> ARP reply is sent from the same arp_process() using
> NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
> pneigh_enqueue())
>
> And here the problem comes. Linux kernel calls pneigh_queue_purge()
> which destroys the whole queue of ARP requests on ANY network interface
> start/stop event through __neigh_ifdown().
>
> This is actually not a problem within the original world as network
> interface start/stop was accessible to the host 'root' only, which
> could do more destructive things. But the world is changed and there
> are Linux containers available. Here container 'root' has an access
> to this API and could be considered as untrusted user in the hosting
> (container's) world.
>
> Thus there is an attack vector to other containers on node when
> container's root will endlessly start/stop interfaces. We have observed
> similar situation on a real production node when docker container was
> doing such activity and thus other containers on the node become not
> accessible.
>
> The patch proposed doing very simple thing. It drops only packets from
> the same namespace in the pneigh_queue_purge() where network interface
> state change is detected. This is enough to prevent the problem for the
> whole node preserving original semantics of the code.

This is how I'd do it as well.

>
> v2:
> - do del_timer_sync() if queue is empty after pneigh_queue_purge()
>
> Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
> Cc: Eric Dumazet <edumazet@xxxxxxxxxx>
> Cc: Jakub Kicinski <kuba@xxxxxxxxxx>
> Cc: Paolo Abeni <pabeni@xxxxxxxxxx>
> Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> Cc: David Ahern <dsahern@xxxxxxxxxx>
> Cc: Yajun Deng <yajun.deng@xxxxxxxxx>
> Cc: Roopa Prabhu <roopa@xxxxxxxxxx>
> Cc: Christian Brauner <brauner@xxxxxxxxxx>
> Cc: netdev@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Cc: Alexey Kuznetsov <kuznet@xxxxxxxxxxxxx>
> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx>
> Cc: Konstantin Khorenko <khorenko@xxxxxxxxxxxxx>
> Cc: kernel@xxxxxxxxxx
> Cc: devel@xxxxxxxxxx
> Investigated-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@xxxxxxxxxxxxx>
> Signed-off-by: Denis V. Lunev <den@xxxxxxxxxx>
> ---
> net/core/neighbour.c | 25 +++++++++++++++++--------
> 1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 54625287ee5b..19d99d1eff53 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
> return 0;
> }
>
> -static void pneigh_queue_purge(struct sk_buff_head *list)
> +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
> {
> + unsigned long flags;
> struct sk_buff *skb;
>
> - while ((skb = skb_dequeue(list)) != NULL) {
> - dev_put(skb->dev);
> - kfree_skb(skb);
> + spin_lock_irqsave(&list->lock, flags);

I'm a bit surprised to see a spinlock held around a while loop walking a
linked list but that seems to be quite common in this file. I take it
the lists are guaranteed to be short.

> + skb = skb_peek(list);
> + while (skb != NULL) {
> + struct sk_buff *skb_next = skb_peek_next(skb, list);
> + if (net == NULL || net_eq(dev_net(skb->dev), net)) {
> + __skb_unlink(skb, list);
> + dev_put(skb->dev);
> + kfree_skb(skb);
> + }
> + skb = skb_next;
> }
> + spin_unlock_irqrestore(&list->lock, flags);
> }
>
> static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
> @@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
> write_lock_bh(&tbl->lock);
> neigh_flush_dev(tbl, dev, skip_perm);
> pneigh_ifdown_and_unlock(tbl, dev);
> -
> - del_timer_sync(&tbl->proxy_timer);
> - pneigh_queue_purge(&tbl->proxy_queue);
> + pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
> + if (skb_queue_empty_lockless(&tbl->proxy_queue))
> + del_timer_sync(&tbl->proxy_timer);
> return 0;
> }
>
> @@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
> cancel_delayed_work_sync(&tbl->managed_work);
> cancel_delayed_work_sync(&tbl->gc_work);
> del_timer_sync(&tbl->proxy_timer);
> - pneigh_queue_purge(&tbl->proxy_queue);
> + pneigh_queue_purge(&tbl->proxy_queue, NULL);
> neigh_ifdown(tbl, NULL);
> if (atomic_read(&tbl->entries))
> pr_crit("neighbour leakage\n");
> --
> 2.36.1
>