Re: handling unsupported optlen in cgroup bpf getsockopt: (was [PATCH net-next v4 2/4] net: socket: add sockopts blacklist for BPF cgroup hook)

From: Kui-Feng Lee
Date: Tue Apr 25 2023 - 13:59:49 EST




On 4/18/23 09:47, Stanislav Fomichev wrote:
On 04/17, Martin KaFai Lau wrote:
On 4/14/23 6:55 PM, Stanislav Fomichev wrote:
On 04/13, Stanislav Fomichev wrote:
On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn
<aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:

On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:

On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn
<aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:

During work on SO_PEERPIDFD, it was discovered (thanks to Christian),
that bpf cgroup hook can cause FD leaks when used with sockopts which
install FDs into the process fdtable.

After some offlist discussion it was proposed to add a blacklist of

We try to replace this word by either denylist or blocklist, even in changelogs.

Hi Eric,

Oh, I'm sorry about that. :( Sure.


socket options those can cause troubles when BPF cgroup hook is enabled.


Can we find the appropriate Fixes: tag to help stable teams ?

Sure, I will add next time.

Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")

I think it's better to add Stanislav Fomichev to CC.

Can we use 'struct proto' bpf_bypass_getsockopt instead? We already
use it for tcp zerocopy, I'm assuming it should work in this case as
well?

Jakub reminded me of the other things I wanted to ask here bug forgot:

- setsockopt is probably not needed, right? setsockopt hook triggers
before the kernel and shouldn't leak anything
- for getsockopt, instead of bypassing bpf completely, should we instead
ignore the error from the bpf program? that would still preserve
the observability aspect

stealing this thread to discuss the optlen issue which may make sense to
bypass also.

There has been issue with optlen. Other than this older post related to
optlen > PAGE_SIZE:
https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@xxxxxxxxx/,
the recent one related to optlen that we have seen is
NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel
put the expected optlen (> 0) and 'return 0;' to userspace. The userspace
intention is to learn the expected optlen. This makes 'ctx.optlen >
max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning
-EFAULT to the userspace even the bpf prog has not changed anything.

(ignoring -EFAULT issue) this seems like it needs to be

if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) {
/* error */
}

?

Does it make sense to also bypass the bpf prog when 'ctx.optlen >
max_optlen' for now (and this can use a separate patch which as usual
requires a bpf selftests)?

Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something
seems like the way to go. It caused too much trouble already :-(

Should I prepare a patch or do you want to take a stab at it?

In the future, does it make sense to have a specific cgroup-bpf-prog (a
specific attach type?) that only uses bpf_dynptr kfunc to access the optval
such that it can enforce read-only for some optname and potentially also
track if bpf-prog has written a new optval? The bpf-prog can only return 1
(OK) and only allows using bpf_set_retval() instead. Likely there is still
holes but could be a seed of thought to continue polishing the idea.

Ack, let's think about it.

Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea
as well? If we can have a sleepable hook that can copy_from_user/copy_to_user,
and we have a mostly working bpf_getsockopt (after your refactoring),
I don't see why we need to continue the current scheme of triggering
after the kernel?

Since a sleepable hook would cause some restrictions, perhaps, we could introduce something like the promise pattern. In our case here, BPF program call an async version of copy_from_user()/copy_to_user() to return a promise.


- or maybe we can even have a per-proto bpf_getsockopt_cleanup call that
gets called whenever bpf returns an error to make sure protocols have
a chance to handle that condition (and free the fd)