Re: Divide error in minstrel_ht_get_tp_avg()

From: Oleksandr Natalenko
Date: Sat Aug 21 2021 - 07:14:48 EST


Hello.

On pondělí 7. června 2021 16:52:30 CEST Oleksandr Natalenko wrote:
> On Sat, May 29, 2021 at 06:57:29PM +0200, Oleksandr Natalenko wrote:
> > The following woe happened on my home router running as a Wi-Fi AP with
> > MT7612:
> >
> > ```
> > [16592.157962] divide error: 0000 [#1] PREEMPT SMP PTI
> > [16592.163169] CPU: 2 PID: 683 Comm: mt76-tx phy0 Tainted: G C
> > 5.12.0-pf4 #1 [16592.171745] Hardware name: To Be Filled By O.E.M. To
> > Be Filled By O.E.M./J3710-ITX, BIOS P1.50 04/16/2018 [16592.182155] RIP:
> > 0010:minstrel_ht_get_tp_avg+0xb1/0x100 [mac80211] [16592.188795] Code: 04
> > 00 00 00 7f 1c 31 c9 81 fa f1 49 02 00 0f 9c c1 8d 0c cd 08 00 00 00 eb
> > 08 8b 47 30 b9 01 00 00 00 69 c0 e8 03 00 00 31 d2 <f7> f1 ba 66 0e 00 00
> > 39 d6 0f 4f f2 49 63 d1 48 8d 0c 52 48 8d 0c [16592.208796] RSP:
> > 0018:ffffb6a601293be8 EFLAGS: 00010246
> > [16592.214292] RAX: 000000000001a5e0 RBX: ffff9c658ee95170 RCX:
> > 0000000000000000 [16592.222054] RDX: 0000000000000000 RSI:
> > 0000000000000585 RDI: ffff9c658ee94000 [16592.229620] RBP:
> > 0000000000000006 R08: 0000000000000006 R09: 0000000000000012
> > [16592.237254] R10: 0000000000000007 R11: 0000000000000000 R12:
> > ffff9c658ee94000 [16592.244828] R13: 0000000000000012 R14:
> > ffff9c658ee9534c R15: 0000000000000585 [16592.252635] FS:
> > 0000000000000000(0000) GS:ffff9c66f8500000(0000) knlGS:0000000000000000
> > [16592.261086] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [16592.267326] CR2: 0000555d81abe7b8 CR3: 0000000116c10000 CR4:
> > 00000000001006e0 [16592.274904] Call Trace:
> > [16592.277549] minstrel_ht_update_stats+0x1fe/0x1320 [mac80211]
> > [16592.283730] minstrel_ht_tx_status+0x67f/0x710 [mac80211]
> > [16592.289442] rate_control_tx_status+0x6e/0xb0 [mac80211]
> > [16592.295267] ieee80211_tx_status_ext+0x22e/0xb00 [mac80211]
> > [16592.311290] ieee80211_tx_status+0x7d/0xa0 [mac80211]
> > [16592.316714] mt76_tx_status_unlock+0x83/0xa0 [mt76]
> > [16592.321999] mt76x02_send_tx_status+0x1b7/0x400 [mt76x02_lib]
> > [16592.334516] mt76x02_tx_worker+0x8f/0xd0 [mt76x02_lib]
> > [16592.340091] __mt76_worker_fn+0x78/0xb0 [mt76]
> > [16592.345067] kthread+0x183/0x1b0
> > [16592.353378] ret_from_fork+0x22/0x30
> > ```
> >
> > `faddr2line` says it is this:
> >
> > ```
> >
> > 430 int
> > 431 minstrel_ht_get_tp_avg(struct minstrel_ht_sta *mi, int group, int
> > rate,
> > 432 int prob_avg)
> > 433 {
> >
> > …
> >
> > 435 unsigned int ampdu_len = 1;
> >
> > …
> >
> > 441 if (minstrel_ht_is_legacy_group(group))
> >
> > …
> >
> > 443 else
> > 444 ampdu_len = minstrel_ht_avg_ampdu_len(mi);
> >
> > …
> >
> > 446 nsecs = 1000 * overhead / ampdu_len;
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > ```
> >
> > So, it seems `minstrel_ht_avg_ampdu_len()` can return 0, which is not
> > really legitimate.
> >
> > Looking at `minstrel_ht_avg_ampdu_len()`, I see the following:
> >
> > ```
> > 16:#define MINSTREL_SCALE 12
> > …
> > 18:#define MINSTREL_TRUNC(val) ((val) >> MINSTREL_SCALE)
> > ```
> >
> > ```
> >
> > 401 static unsigned int
> > 402 minstrel_ht_avg_ampdu_len(struct minstrel_ht_sta *mi)
> > 403 {
> >
> > …
> >
> > 406 if (mi->avg_ampdu_len)
> > 407 return MINSTREL_TRUNC(mi->avg_ampdu_len);
> >
> > ```
> >
> > So, likely, `mi->avg_ampdu_len` is non-zero, but it's too small, hence
> > right bitshift makes it zero.
> >
> > Should there be some protection against such a situation?
> >
> > Felix, could you maybe check this? Looks like you were the last one to
> > overhaul that part of code.
>
> Quick hack from me:
>
> commit a23d3f77658de968c5bb852ba478402c93958f7f
> Author: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx>
> Date: Mon Jun 7 16:45:45 2021 +0200
>
> mac80211: minstrel_ht: force ampdu_len to be > 0
>
> This is a hack.
>
> Work around the following crash:
>
> ```
> divide error: 0000 [#1] PREEMPT SMP PTI
> CPU: 2 PID: 683 Comm: mt76-tx phy0 Tainted: G C
> 5.12.0-pf4 #1 Hardware name: To Be Filled By O.E.M. To Be Filled By
> O.E.M./J3710-ITX, BIOS P1.50 04/16/2018 RIP:
> 0010:minstrel_ht_get_tp_avg+0xb1/0x100 [mac80211]
>
> Call Trace:
> minstrel_ht_update_stats+0x1fe/0x1320 [mac80211]
> minstrel_ht_tx_status+0x67f/0x710 [mac80211]
> rate_control_tx_status+0x6e/0xb0 [mac80211]
> ieee80211_tx_status_ext+0x22e/0xb00 [mac80211]
> ieee80211_tx_status+0x7d/0xa0 [mac80211]
> mt76_tx_status_unlock+0x83/0xa0 [mt76]
> mt76x02_send_tx_status+0x1b7/0x400 [mt76x02_lib]
> mt76x02_tx_worker+0x8f/0xd0 [mt76x02_lib]
> __mt76_worker_fn+0x78/0xb0 [mt76]
> kthread+0x183/0x1b0
> ret_from_fork+0x22/0x30
> ```
>
> Link:
> https://lore.kernel.org/lkml/20210529165728.bskaozwtmwxnvucx@spock.localdom
> ain/ Signed-off-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx>
> ---
> diff --git a/net/mac80211/rc80211_minstrel_ht.c
> b/net/mac80211/rc80211_minstrel_ht.c index ecad9b10984f..6ad188c4101e
> 100644
> --- a/net/mac80211/rc80211_minstrel_ht.c
> +++ b/net/mac80211/rc80211_minstrel_ht.c
> @@ -440,8 +440,13 @@ minstrel_ht_get_tp_avg(struct minstrel_ht_sta *mi, int
> group, int rate,
>
> if (minstrel_ht_is_legacy_group(group))
> overhead = mi->overhead_legacy;
> - else
> + else {
> ampdu_len = minstrel_ht_avg_ampdu_len(mi);
> + if (unlikely(!ampdu_len)) {
> + pr_err_once("minstrel_ht_get_tp_avg: ampdu_len == 0!");
> + ampdu_len = 1;
> + }
> + }
>
> nsecs = 1000 * overhead / ampdu_len;
> nsecs += minstrel_mcs_groups[group].duration[rate] <<

I've also found out that this happens exactly at midnight, IOW, at 00:00:00.
Not every midnight, though.

Does it have something to do with timekeeping? This is strange, I wouldn't
expect kernel to act like that. Probably, some client sends malformed frame?
How to find out?

Thanks.

P.S. The hack works, BTW, the system survives, and AP runs as if nothing
happened.

--
Oleksandr Natalenko (post-factum)