Re: [RFC bpf-next] xsk: honor SO_BINDTODEVICE on bind

From: Ilya Maximets
Date: Mon Jul 03 2023 - 06:13:55 EST


On 7/3/23 12:06, Ilya Maximets wrote:
> On 7/3/23 11:48, Magnus Karlsson wrote:
>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@xxxxxxx> wrote:
>>>
>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
>>> A privileged process might create the socket and pass it to a
>>> non-privileged process for later use. However, that process will be
>>> able to bind the socket to any network interface. Even though it will
>>> not be able to receive any traffic without modification of the BPF map,
>>> the situation is not ideal.
>>>
>>> Sockets already have a mechanism that can be used to restrict what
>>> interface they can be attached to. That is SO_BINDTODEVICE.
>>>
>>> To change the binding the process will need CAP_NET_RAW.
>>>
>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
>>> workflow when non-privileged process is using AF_XDP.
>>
>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
>> return an error from bind. So if I understand the purpose of
>> SO_BINDTODEVICE correctly, you could say that this option is always
>> set for an AF_XDP socket and it is not possible to toggle it. The only
>> way to "rebind" an AF_XDP socket is to close it and open a new one.
>> This was a conscious design decision from day one as it would be very
>> hard to support this, especially in zero-copy mode.
>
> Hi, Magnus.
>
> The purpose of this patch is not to allow re-binding. The use case is
> following:
>
> 1. First process creates a bare socket with socket(AF_XDP, ...).
> 2. First process loads the XSK program to the interface.
> 3. First process adds the socket fd to a BPF map.
> 4. First process sends socket fd to a second process.
> 5. Second process allocates UMEM.
> 6. Second process binds socket to the interface.

7. Second process sends/receives the traffic. :)

>
> The idea is that the first process will call SO_BINDTODEVICE before
> sending socket fd to a second process, so the second process is limited
> in to which interface it can bind the socket.
>
> Does that make sense?
>
> This workflow allows the second process to have no capabilities
> as long as it has sufficient RLIMIT_MEMLOCK.

Note that steps 1-7 are working just fine today. i.e. the umem
registration, bind, ring mapping and traffic send/receive do not
require any extra capabilities.

We may restrict the bind() call to require CAP_NET_RAW and then
allow it for sockets that had SO_BINDTODEVICE as an alternative.
But restriction will break the current uAPI.

>
> Best regards, Ilya Maximets.
>
>>
>>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxx>
>>> ---
>>>
>>> Posting as an RFC for now to probably get some feedback.
>>> Will re-post once the tree is open.
>>>
>>> Documentation/networking/af_xdp.rst | 9 +++++++++
>>> net/xdp/xsk.c | 6 ++++++
>>> 2 files changed, 15 insertions(+)
>>>
>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>>> index 247c6c4127e9..1cc35de336a4 100644
>>> --- a/Documentation/networking/af_xdp.rst
>>> +++ b/Documentation/networking/af_xdp.rst
>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
>>> application to use. The final option is the flags field, but it will
>>> be dealt with in separate sections for each UMEM flag.
>>>
>>> +SO_BINDTODEVICE setsockopt
>>> +--------------------------
>>> +
>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
>>> +socket to a particular network interface. It is useful when a socket
>>> +is created by a privileged process and passed to a non-privileged one.
>>> +Once the option is set, kernel will refuse attempts to bind that socket
>>> +to a different interface. Updating the value requires CAP_NET_RAW.
>>> +
>>> XDP_STATISTICS getsockopt
>>> -------------------------
>>>
>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>> index 5a8c0dd250af..386ff641db0f 100644
>>> --- a/net/xdp/xsk.c
>>> +++ b/net/xdp/xsk.c
>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>> struct sock *sk = sock->sk;
>>> struct xdp_sock *xs = xdp_sk(sk);
>>> struct net_device *dev;
>>> + int bound_dev_if;
>>> u32 flags, qid;
>>> int err = 0;
>>>
>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>> XDP_USE_NEED_WAKEUP))
>>> return -EINVAL;
>>>
>>> + bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
>>> +
>>> + if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
>>> + return -EINVAL;
>>> +
>>> rtnl_lock();
>>> mutex_lock(&xs->mutex);
>>> if (xs->state != XSK_READY) {
>>> --
>>> 2.40.1
>>>
>>>
>