Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue_drop

From: David Ahern
Date: Fri Jul 14 2023 - 21:31:18 EST


On 7/14/23 5:38 PM, Ivan Babrou wrote:
> On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@xxxxxxxxxx> wrote:
>>> We can start a separate discussion to break it down by category if it
>>> would help. Let me know what kind of information you would like us to
>>> provide to help with that. I assume you're interested in kernel stacks
>>> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
>>> something else.
>>
>> stack traces would be helpful.
>
> Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@xxxxxxxxxxxxxx/
>
>>> Even if I was only interested in one specific reason, I would still
>>> have to arm the whole tracepoint and route a ton of skbs I'm not
>>> interested in into my bpf code. This seems like a lot of overhead,
>>> especially if I'm dropping some attack packets.
>>
>> you can add a filter on the tracepoint event to limit what is passed
>> (although I have not tried the filter with an ebpf program - e.g.,
>> reason != NOT_SPECIFIED).
>
> Absolutely, but isn't there overhead to even do just that for every freed skb?

There is some amount of overhead. If filters can be used with ebpf
programs, then the differential cost is just the cycles for the filter
which in this case is an integer compare. Should be low - maybe Steven
has some data on the overhead?

>
>>> If you have an ebpf example that would help me extract the destination
>>> port from an skb in kfree_skb, I'd be interested in taking a look and
>>> trying to make it work.
>>
>> This is from 2020 and I forget which kernel version (pre-BTF), but it
>> worked at that time and allowed userspace to summarize drop reasons by
>> various network data (mac, L3 address, n-tuple, etc):
>>
>> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c
>
> It doesn't seem to extract the L4 metadata (local port specifically),
> which is what I'm after.

This program takes the path of copy headers to userspace and does the
parsing there (there is a netmon program that uses that ebpf program
which shows drops for varying perspectives). You can just as easily do
the parsing in ebpf. Once you have the start of packet data, walk the
protocols of interest -- e.g., see parse_pkt in flow.h.

>
>>> The need to extract the protocol level information in ebpf is only
>>> making kfree_skb more expensive for the needs of catching rare cases
>>> when we run out of buffer space (UDP) or listen queue (TCP). These two
>>> cases are very common failure scenarios that people are interested in
>>> catching with straightforward tracepoints that can give them the
>>> needed information easily and cheaply.
>>>
>>> I sympathize with the desire to keep the number of tracepoints in
>>> check, but I also feel like UDP buffer drops and TCP listen drops
>>> tracepoints are very much justified to exist.
>>
>> sure, kfree_skb is like the raw_syscall tracepoint - it can be more than
>> what you need for a specific problem, but it is also give you way more
>> than you are thinking about today.
>
> I really like the comparison to raw_syscall tracepoint. There are two flavors:
>
> 1. Catch-all: raw_syscalls:sys_enter, which is similar to skb:kfree_skb.
> 2. Specific tracepoints: syscalls:sys_enter_* for every syscall.
>
> If you are interested in one rare syscall, you wouldn't attach to a
> catch-all firehose and the filter for id in post. Understandably, we
> probably can't have a separate skb:kfree_skb for every reason.

yea, I pushed for the 'reason' design rather than having a tracepoint
per drop location as it is way more scalable. There is a balance and I
believe that is what Kuba is pushing at here.

> However, some of them are more useful than others and I believe that
> tcp listen drops fall into that category.
>
> We went through a similar exercise with audit subsystem, which in fact
> always arms all syscalls even if you audit one of them:
>
> * https://lore.kernel.org/audit/20230523181624.19932-1-ivan@xxxxxxxxxxxxxx/T/#u
>
> With pictures, if you're interested:
>
> * https://mastodon.ivan.computer/@mastodon/110426498281603668
>
> Nobody audits futex(), but if you audit execve(), all the rules run
> for both. Some rules will run faster, but all of them will run. It's a
> lot of overhead with millions of CPUs, which I'm trying to avoid (the
> planet is hot as it is).
>
> Ultimately my arguments for a separate tracepoint for tcp listen drops
> are at the bottom of my reply to Jakub:
>
> * https://lore.kernel.org/netdev/CABWYdi2BGi=iRCfLhmQCqO=1eaQ1WaCG7F9WsJrz-7==ocZidg@xxxxxxxxxxxxxx/