Re: [BUG] brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout (WiFi dies)

From: Dmitry Osipenko
Date: Fri Jun 18 2021 - 16:00:41 EST


28.05.2021 01:47, Dmitry Osipenko пишет:
> 27.05.2021 19:42, Arend van Spriel пишет:
>> On 5/26/2021 5:10 PM, Dmitry Osipenko wrote:
>>> Hello,
>>>
>>> After updating to Ubuntu 21.04 I found two problems related to the
>>> BRCMF_C_GET_ASSOCLIST using an older BCM4329 SDIO WiFi.
>>>
>>> 1. The kernel is spammed with:
>>>
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>
>>> Which happens apparently due to a newer NetworkManager version that
>>> pokes dump_station() periodically. I sent [1] that fixes this noise.
>>>
>>> [1]
>>> https://patchwork.kernel.org/project/linux-wireless/list/?series=480715
>>
>> Right. I noticed this one and did not have anything to add to the
>> review/suggestion.
>
> Please feel free to add yours r-b to the patches if they are good to you.
>
>>> 2. The other much worse problem is that WiFi eventually dies now with
>>> these errors:
>>>
>>> ...
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-52
>>>   brcmfmac: brcmf_sdio_bus_rxctl: resumed on timeout
>>>   ieee80211 phy0: brcmf_cfg80211_dump_station: BRCMF_C_GET_ASSOCLIST
>>> unsupported, err=-110
>>>   ieee80211 phy0: brcmf_proto_bcdc_query_dcmd: brcmf_proto_bcdc_msg
>>> failed w/status -110
>>>
>>>  From this point all firmware calls start to fail with err=-110 and
>>> WiFi doesn't work anymore. This problem is reproducible with 5.13-rc
>>> and current -next, I haven't checked older kernel versions. Somehow
>>> it's worse using a recent -next, WiFi dies quicker.
>>>
>>> What's interesting is that I see that there is always a pending signal
>>> in brcmf_sdio_dcmd_resp_wait() when timeout happens. It looks like the
>>> timeout happens when there is access to a swap partition, which stalls
>>> system for a second or two, but this is not 100%. Increasing
>>> DCMD_RESP_TIMEOUT doesn't help.
>>
>> The timeout error (-110) can have two root causes that I am aware off.
>> Either the firmware died or the SDIO layer has gone haywire. Not sure if
>> that swap partition is on eMMC device, but if so it could be related.
>> You could try generating device coredump. If that also gives -110 errors
>> we know it is the SDIO layer.
>
> Coredump is a good idea, thank you. The swap partition is on external SD
> card, everything else is on eMMC.
>
>>> Please let me know if you have any ideas of how to fix this trouble
>>> properly or if you need need any more info.
>>>
>>> Removing BRCMF_C_GET_ASSOCLIST firmware call entirely from the driver
>>> fixes the problem.
>>
>> My guess is that reducing interaction with firmware is what is avoiding
>> the issue and not so much this specific firmware command. As always it
>> is good to know the conditions in which the issue occurs. What is the
>> hardware platform you are running Ubuntu on? Stuff like that.
>
> That's an older Acer A500 NVIDIA Tegra20 tablet device [1]. I may also
> try to reproduce problem on Tegra30 Nexus 7 with BCM4330.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/tegra20-acer-a500-picasso.dts
>
> Thank you very much for the suggestions. I will try to collect more info
> and come back with the report.
>

I was testing this for the past weeks and the problem is not
reproducible anymore. Apparently something got fixed in linux-next. I
haven't tried to bisect the fix since it's a bit too painful to do.

Still there are occasional -110 errors when system stalls on a memory
swap, but WiFi keeps working now.