Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue_drop

From: Ivan Babrou
Date: Fri Jul 14 2023 - 19:38:57 EST


On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@xxxxxxxxxx> wrote:
> > We can start a separate discussion to break it down by category if it
> > would help. Let me know what kind of information you would like us to
> > provide to help with that. I assume you're interested in kernel stacks
> > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> > something else.
>
> stack traces would be helpful.

Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@xxxxxxxxxxxxxx/

> > Even if I was only interested in one specific reason, I would still
> > have to arm the whole tracepoint and route a ton of skbs I'm not
> > interested in into my bpf code. This seems like a lot of overhead,
> > especially if I'm dropping some attack packets.
>
> you can add a filter on the tracepoint event to limit what is passed
> (although I have not tried the filter with an ebpf program - e.g.,
> reason != NOT_SPECIFIED).

Absolutely, but isn't there overhead to even do just that for every freed skb?

> > If you have an ebpf example that would help me extract the destination
> > port from an skb in kfree_skb, I'd be interested in taking a look and
> > trying to make it work.
>
> This is from 2020 and I forget which kernel version (pre-BTF), but it
> worked at that time and allowed userspace to summarize drop reasons by
> various network data (mac, L3 address, n-tuple, etc):
>
> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c

It doesn't seem to extract the L4 metadata (local port specifically),
which is what I'm after.

> > The need to extract the protocol level information in ebpf is only
> > making kfree_skb more expensive for the needs of catching rare cases
> > when we run out of buffer space (UDP) or listen queue (TCP). These two
> > cases are very common failure scenarios that people are interested in
> > catching with straightforward tracepoints that can give them the
> > needed information easily and cheaply.
> >
> > I sympathize with the desire to keep the number of tracepoints in
> > check, but I also feel like UDP buffer drops and TCP listen drops
> > tracepoints are very much justified to exist.
>
> sure, kfree_skb is like the raw_syscall tracepoint - it can be more than
> what you need for a specific problem, but it is also give you way more
> than you are thinking about today.

I really like the comparison to raw_syscall tracepoint. There are two flavors:

1. Catch-all: raw_syscalls:sys_enter, which is similar to skb:kfree_skb.
2. Specific tracepoints: syscalls:sys_enter_* for every syscall.

If you are interested in one rare syscall, you wouldn't attach to a
catch-all firehose and the filter for id in post. Understandably, we
probably can't have a separate skb:kfree_skb for every reason.
However, some of them are more useful than others and I believe that
tcp listen drops fall into that category.

We went through a similar exercise with audit subsystem, which in fact
always arms all syscalls even if you audit one of them:

* https://lore.kernel.org/audit/20230523181624.19932-1-ivan@xxxxxxxxxxxxxx/T/#u

With pictures, if you're interested:

* https://mastodon.ivan.computer/@mastodon/110426498281603668

Nobody audits futex(), but if you audit execve(), all the rules run
for both. Some rules will run faster, but all of them will run. It's a
lot of overhead with millions of CPUs, which I'm trying to avoid (the
planet is hot as it is).

Ultimately my arguments for a separate tracepoint for tcp listen drops
are at the bottom of my reply to Jakub:

* https://lore.kernel.org/netdev/CABWYdi2BGi=iRCfLhmQCqO=1eaQ1WaCG7F9WsJrz-7==ocZidg@xxxxxxxxxxxxxx/