Re: [RFC PATCH net-next 1/2] net: Use SMP threads for backlog NAPI.

From: Paolo Abeni
Date: Fri Sep 22 2023 - 05:39:09 EST


On Wed, 2023-09-20 at 17:57 +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-23 15:35:41 [+0200], Paolo Abeni wrote:
> > On Mon, 2023-08-14 at 11:35 +0200, Sebastian Andrzej Siewior wrote:
> > > @@ -4781,7 +4733,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
> > > * We can use non atomic operation since we own the queue lock
> > > */
> > > if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state))
> > > - napi_schedule_rps(sd);
> > > + __napi_schedule_irqoff(&sd->backlog);
> > > goto enqueue;
> > > }
> > > reason = SKB_DROP_REASON_CPU_BACKLOG;
> >
> > I *think* that the above could be quite dangerous when cpu ==
> > smp_processor_id() - that is, with plain veth usage.
> >
> > Currently, each packet runs into the rx path just after
> > enqueue_to_backlog()/tx completes.
> >
> > With this patch there will be a burst effect, where the backlog thread
> > will run after a few (several) packets will be enqueued, when the
> > process scheduler will decide - note that the current CPU is already
> > hosting a running process, the tx thread.
> >
> > The above can cause packet drops (due to limited buffering) or very
> > high latency (due to long burst), even in non overload situation, quite
> > hard to debug.
> >
> > I think the above needs to be an opt-in, but I guess that even RT
> > deployments doing some packet forwarding will not be happy with this
> > on.
>
> I've been looking at this again and have been thinking what you said
> here. I think part of the problem is that we lack a policy/ mechanism
> when a DoS is happening and what to do.
>
> Before commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its
> job"") when a lot of network packets are processed then processing is
> moved to ksoftirqd and continues based on how the scheduler schedules
> the SCHED_OTHER ksoftirqd task. This avoids lock-ups of the system and
> it can do something else in between. Any interrupt will not continue the
> outstanding softirq backlog but wait for ksoftirqd. So it basically
> avoids the networking overload. It throttles the throughput if needed.
>
> This isn't the case after that commit. Now, the CPU can be stuck with
> processing networking packets if the packets come in fast enough. Even
> if ksoftirqd is woken up, the next interrupt (say the timer) will
> continue with at least one round.
> By using NAPI-threads it is possible to give the control back to the
> scheduler which can throttle the NAPI processing in favour of other
> threads that ask for CPU. As you pointed out, waking the thread does not
> guarantee that it will immediately do the NAPI work. It can be delayed
> based on current load on the system.
>
> This could be influenced by assigning the NAPI-thread a SCHED_FIFO
> priority. Based on the priority it could be ensured that the thread
> starts right away or "later" if something else is more important.
> However, this opens the DoS window again: The scheduler will put the
> NAPI thread on CPU as long as it asks for it with no throttling.
>
> If we could somehow define a DoS condition once we are overwhelmed with
> packets, then we could act on it and throttle it. This in turn would
> allow a SCHED_FIFO priority without the fear of a lockup if the system
> is flooded with packets.

I declare ENOCOFFEE before starting, be warned!

I fear this is becoming a bit too theoretical, but we can infer a DoS
condition if the napi thread enqueues somewhere (socket buffer, qdisc,
tx ring, ???) a packet and the queue utilization is "high" (say > 75%
of max).

I have no idea how to throttle a FIFO thread retaining its priority.

More importantly, this kind of configuration is not really viable for a
generic !PREEMPT_RT build, while the concern I have with napi threaded
backlog/serving the backlog with ksoftirqd applies there.

Cheers,

Paolo