Re: [PATCH] wifi: iwlwifi: Fix spurious packet drops with RSS

From: Sultan Alsawaf
Date: Thu May 04 2023 - 13:55:43 EST


On Thu, May 04, 2023 at 02:10:50PM +0200, Johannes Berg wrote:
> [let's see if my reply will make it to the list, the original seems to
> not have]
>
> On Sun, 2023-04-30 at 00:13 +0000, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf <sultan@xxxxxxxxxxxxxxx>
> >
> > When RSS is used and one of the RX queues lags behind others by more than
> > 2048 frames, then new frames arriving on the lagged RX queue are
> > incorrectly treated as old rather than new by the reorder buffer, and are
> > thus spuriously dropped. This is because the reorder buffer treats frames
> > as old when they have an SN that is more than 2048 away from the head SN,
> > which causes the reorder buffer to drop frames that are actually valid.
> >
> > The odds of this occurring naturally increase with the number of
> > RX queues used, so CPUs with many threads are more susceptible to
> > encountering spurious packet drops caused by this issue.
> >
> > As it turns out, the firmware already detects when a frame is either old or
> > duplicated and exports this information, but it's currently unused. Using
> > these firmware bits to decide when frames are old or duplicated fixes the
> > spurious drops.
>
> So I assume you tested it now, and it works? Somehow I had been under
> the impression we never got it to work back when...

Yep, I've been using this for about a year and have let it run through the
original iperf3 reproducer I mentioned on bugzilla for hours with no stalls. My
big git clones don't freeze anymore either. :)

What I wasn't able to get working was the big reorder buffer cleanup that's made
possible by using these firmware bits. The explicit queue sync can be removed
easily, but there were further potential cleanups you had mentioned that I
wasn't able to get working.

I hadn't submitted this patch until now because I was hoping to get the big
cleanup done simultaneously but I got too busy until now. Since this small patch
does fix the issue, my thought is that this could be merged and sent to stable,
and with subsequent patches I can chip away at cleaning up the reorder buffer.

> > Johannes mentions that the 9000 series' firmware doesn't support these
> > bits, so disable RSS on the 9000 series chipsets since they lack a
> > mechanism to properly detect old and duplicated frames.
>
> Indeed, I checked this again, I also somehow thought it was backported
> to some versions but doesn't look like. We can either leave those old
> ones broken (they only shipped with fewer cores anyway), or just disable
> it as you did here, not sure. RSS is probably not as relevant with those
> slower speeds anyway.

Agreed, I think it's worth disabling RSS on 9000 series to fix it there. If the
RX queues are heavily backed up and incoming packets are not released fast
enough due to a slow CPU, then I think the spurious drops could happen somewhat
regularly on slow devices using 9000 series.

It's probably also difficult to judge the impact/frequency of these spurious
drops in the wild due to TCP retries potentially masking them. The issue can be
very noticeable when a lot of packets are spuriously dropped at once though, so
I think it's certainly worth the tradeoff to disable RSS on the older chipsets.

> > +++ b/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c
> > @@ -918,7 +918,6 @@ static bool iwl_mvm_reorder(struct iwl_mvm *mvm,
> > struct iwl_mvm_sta *mvm_sta;
> > struct iwl_mvm_baid_data *baid_data;
> > struct iwl_mvm_reorder_buffer *buffer;
> > - struct sk_buff *tail;
> > u32 reorder = le32_to_cpu(desc->reorder_data);
> > bool amsdu = desc->mac_flags2 & IWL_RX_MPDU_MFLG2_AMSDU;
> > bool last_subframe =
> > @@ -1020,7 +1019,7 @@ static bool iwl_mvm_reorder(struct iwl_mvm *mvm,
> > rx_status->device_timestamp, queue);
> >
> > /* drop any oudated packets */
> > - if (ieee80211_sn_less(sn, buffer->head_sn))
> > + if (reorder & IWL_RX_MPDU_REORDER_BA_OLD_SN)
> > goto drop;
> >
> > /* release immediately if allowed by nssn and no stored frames */
> > @@ -1068,24 +1067,12 @@ static bool iwl_mvm_reorder(struct iwl_mvm *mvm,
> > return false;
> > }
>
> All that "send queue sync" code in the middle that was _meant_ to fix
> this issue but I guess never really did can also be removed, no? And the
> timer, etc. etc.

Indeed, and removing the queue sync + timer are easy. Would you prefer I send
additional patches for at least those cleanups before the fix itself can be
considered for merging?

Sultan