Re: [RFC bpf-next] xsk: honor SO_BINDTODEVICE on bind

From: Magnus Karlsson
Date: Mon Jul 03 2023 - 06:24:32 EST


On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@xxxxxxx> wrote:
>
> On 7/3/23 12:06, Ilya Maximets wrote:
> > On 7/3/23 11:48, Magnus Karlsson wrote:
> >> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@xxxxxxx> wrote:
> >>>
> >>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
> >>> A privileged process might create the socket and pass it to a
> >>> non-privileged process for later use. However, that process will be
> >>> able to bind the socket to any network interface. Even though it will
> >>> not be able to receive any traffic without modification of the BPF map,
> >>> the situation is not ideal.
> >>>
> >>> Sockets already have a mechanism that can be used to restrict what
> >>> interface they can be attached to. That is SO_BINDTODEVICE.
> >>>
> >>> To change the binding the process will need CAP_NET_RAW.
> >>>
> >>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
> >>> workflow when non-privileged process is using AF_XDP.
> >>
> >> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
> >> return an error from bind. So if I understand the purpose of
> >> SO_BINDTODEVICE correctly, you could say that this option is always
> >> set for an AF_XDP socket and it is not possible to toggle it. The only
> >> way to "rebind" an AF_XDP socket is to close it and open a new one.
> >> This was a conscious design decision from day one as it would be very
> >> hard to support this, especially in zero-copy mode.
> >
> > Hi, Magnus.
> >
> > The purpose of this patch is not to allow re-binding. The use case is
> > following:
> >
> > 1. First process creates a bare socket with socket(AF_XDP, ...).
> > 2. First process loads the XSK program to the interface.
> > 3. First process adds the socket fd to a BPF map.
> > 4. First process sends socket fd to a second process.
> > 5. Second process allocates UMEM.
> > 6. Second process binds socket to the interface.
>
> 7. Second process sends/receives the traffic. :)
>
> >
> > The idea is that the first process will call SO_BINDTODEVICE before
> > sending socket fd to a second process, so the second process is limited
> > in to which interface it can bind the socket.
> >
> > Does that make sense?

Thanks for explaining this to me. Yes, that makes sense and seems
useful. Could you please send a v2 and include the flow (1-7) above in
your commit message? Would be good to add one step with the setsockopt
SO_BINDTODEVICE before step #4 just to be clear. With those changes
please feel free to include my ack:

Acked-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>

Thank you!

> > This workflow allows the second process to have no capabilities
> > as long as it has sufficient RLIMIT_MEMLOCK.
>
> Note that steps 1-7 are working just fine today. i.e. the umem
> registration, bind, ring mapping and traffic send/receive do not
> require any extra capabilities.
>
> We may restrict the bind() call to require CAP_NET_RAW and then
> allow it for sockets that had SO_BINDTODEVICE as an alternative.
> But restriction will break the current uAPI.
>
> >
> > Best regards, Ilya Maximets.
> >
> >>
> >>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxx>
> >>> ---
> >>>
> >>> Posting as an RFC for now to probably get some feedback.
> >>> Will re-post once the tree is open.
> >>>
> >>> Documentation/networking/af_xdp.rst | 9 +++++++++
> >>> net/xdp/xsk.c | 6 ++++++
> >>> 2 files changed, 15 insertions(+)
> >>>
> >>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> >>> index 247c6c4127e9..1cc35de336a4 100644
> >>> --- a/Documentation/networking/af_xdp.rst
> >>> +++ b/Documentation/networking/af_xdp.rst
> >>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
> >>> application to use. The final option is the flags field, but it will
> >>> be dealt with in separate sections for each UMEM flag.
> >>>
> >>> +SO_BINDTODEVICE setsockopt
> >>> +--------------------------
> >>> +
> >>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
> >>> +socket to a particular network interface. It is useful when a socket
> >>> +is created by a privileged process and passed to a non-privileged one.
> >>> +Once the option is set, kernel will refuse attempts to bind that socket
> >>> +to a different interface. Updating the value requires CAP_NET_RAW.
> >>> +
> >>> XDP_STATISTICS getsockopt
> >>> -------------------------
> >>>
> >>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >>> index 5a8c0dd250af..386ff641db0f 100644
> >>> --- a/net/xdp/xsk.c
> >>> +++ b/net/xdp/xsk.c
> >>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> >>> struct sock *sk = sock->sk;
> >>> struct xdp_sock *xs = xdp_sk(sk);
> >>> struct net_device *dev;
> >>> + int bound_dev_if;
> >>> u32 flags, qid;
> >>> int err = 0;
> >>>
> >>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> >>> XDP_USE_NEED_WAKEUP))
> >>> return -EINVAL;
> >>>
> >>> + bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
> >>> +
> >>> + if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
> >>> + return -EINVAL;
> >>> +
> >>> rtnl_lock();
> >>> mutex_lock(&xs->mutex);
> >>> if (xs->state != XSK_READY) {
> >>> --
> >>> 2.40.1
> >>>
> >>>
> >
>