Re: Banana Pi-R1 stabil

From: Maxime Ripard
Date: Tue Mar 05 2019 - 04:28:35 EST


On Sat, Mar 02, 2019 at 09:42:08AM +0100, Gerhard Wiesinger wrote:
> On 01.03.2019 10:30, Maxime Ripard wrote:
> > On Thu, Feb 28, 2019 at 08:41:53PM +0100, Gerhard Wiesinger wrote:
> > > On 28.02.2019 10:35, Maxime Ripard wrote:
> > > > On Wed, Feb 27, 2019 at 07:58:14PM +0100, Gerhard Wiesinger wrote:
> > > > > On 27.02.2019 10:20, Maxime Ripard wrote:
> > > > > > On Sun, Feb 24, 2019 at 09:04:57AM +0100, Gerhard Wiesinger wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > I've 3 Banana Pi R1, one running with self compiled kernel
> > > > > > > 4.7.4-200.BPiR1.fc24.armv7hl and old Fedora 25 which is VERY STABLE, the 2
> > > > > > > others are running with Fedora 29 latest, kernel 4.20.10-200.fc29.armv7hl. I
> > > > > > > tried a lot of kernels between of around 4.11
> > > > > > > (kernel-4.11.10-200.fc25.armv7hl) until 4.20.10 but all had crashes without
> > > > > > > any output on the serial console or kernel panics after a short time of
> > > > > > > period (minutes, hours, max. days)
> > > > > > >
> > > > > > > Latest known working and stable self compiled kernel: kernel
> > > > > > > 4.7.4-200.BPiR1.fc24.armv7hl:
> > > > > > >
> > > > > > > https://www.wiesinger.com/opensource/fedora/kernel/BananaPi-R1/
> > > > > > >
> > > > > > > With 4.8.x the DSA b53 switch infrastructure has been introduced which
> > > > > > > didn't work (until ca8931948344c485569b04821d1f6bcebccd376b and kernel
> > > > > > > 4.18.x):
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/dsa/b53?h=v4.20.12
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/dsa/b53?h=v4.20.12
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/dsa/b53?h=v4.20.12&id=ca8931948344c485569b04821d1f6bcebccd376b
> > > > > > >
> > > > > > > I has been fixed with kernel 4.18.x:
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/net/dsa/b53?h=linux-4.18.y
> > > > > > >
> > > > > > >
> > > > > > > So current status is, that kernel crashes regularly, see some samples below.
> > > > > > > It is typically a "Unable to handle kernel paging request at virtual addres"
> > > > > > >
> > > > > > > Another interesting thing: A Banana Pro works well (which has also an
> > > > > > > Allwinner A20 in the same revision) running same Fedora 29 and latest
> > > > > > > kernels (e.g. kernel 4.20.10-200.fc29.armv7hl.).
> > > > > > >
> > > > > > > Since it happens on 2 different devices and with different power supplies
> > > > > > > (all with enough power) and also the same type which works well on the
> > > > > > > working old kernel) a hardware issue is very unlikely.
> > > > > > >
> > > > > > > I guess it has something to do with virtual memory.
> > > > > > >
> > > > > > > Any ideas?
> > > > > > > [47322.960193] Unable to handle kernel paging request at virtual addres 5675d0
> > > > > > That line is a bit suspicious
> > > > > >
> > > > > > Anyway, cpufreq is known to cause those kind of errors when the
> > > > > > voltage / frequency association is not correct.
> > > > > >
> > > > > > Given the stack trace and that the BananaPro doesn't have cpufreq
> > > > > > enabled, my first guess would be that it's what's happening. Could you
> > > > > > try using the performance governor and see if it's more stable?
> > > > > >
> > > > > > If it is, then using this:
> > > > > > https://github.com/ssvb/cpuburn-arm/blob/master/cpufreq-ljt-stress-test
> > > > > >
> > > > > > will help you find the offending voltage-frequency couple.
> > > > > For me it looks like they have all the same config regarding cpu governor
> > > > > (Banana Pro, old kernel stable one, new kernel unstable ones)
> > > > The Banana Pro doesn't have a regulator set up, so it will only change
> > > > the frequency, not the voltage.
> > > >
> > > > > They all have the ondemand governor set:
> > > > >
> > > > > I set on the 2 unstable "new kernel Banana Pi R1":
> > > > >
> > > > > # Set to max performance
> > > > > echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> > > > > echo "performance" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
> > > > What are the results?
> > > Stable since more than around 1,5 days. Normally they have been crashed for
> > > such a long uptime. So it looks that the performance governor fixes it.
> > >
> > > I guess crashes occour because of changing CPU voltage and clock changes and
> > > invalid data (e.g. also invalid RAM contents might be read, register
> > > problems, etc).
> > >
> > > Any ideas how to fix it for ondemand mode, too?
> > Run https://github.com/ssvb/cpuburn-arm/blob/master/cpufreq-ljt-stress-test
> >
> > > But it doesn't explaing that it works with kernel 4.7.4 without any
> > > problems.
> > My best guess would be that cpufreq wasn't enabled at that time, or
> > without voltage scaling.
> >
>
> Where can I see the voltage scaling parameters?
>
> on DTS I don't see any difference between kernel 4.7.4 and 4.20.10 regarding
> voltage:
>
> dtc -I dtb -O dts -o
> /boot/dtb-4.20.10-200.fc29.armv7hl/sun7i-a20-lamobo-r1.dts
> /boot/dtb-4.20.10-200.fc29.armv7hl/sun7i-a20-lamobo-r1.dtb

This can be also due to configuration being changed, driver support, etc.

> There is another strange thing (tested with
> kernel-5.0.0-0.rc8.git1.1.fc31.armv7hl, kernel-4.19.8-300.fc29.armv7hl,
> kernel-4.20.13-200.fc29.armv7hl, kernel-4.20.10-200.fc29.armv7hl):
>
> There is ALWAYS high CPU of around 10% in kworker:
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+ COMMAND
> 18722 root      20   0       0      0      0 I   9.5   0.0 0:47.52
> [kworker/1:3-events_freezable_power_]
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+ COMMAND
>   776 root      20   0       0      0      0 I   8.6   0.0 0:02.77
> [kworker/0:4-events]

The first one looks like it's part of the workqueue code.

> Therefore CPU doesn't switch to low frequencies (see below).

You said previously that those crashes were happening when the board
was changing frequency, so I'm confused?

> Any ideas?

Run the cpustress program I told you to use already twice.

> BTW: Still stable at aboout 2,5days on both devices. So solution IS the
> performance governor.

No, the performance governor prevents any change in frequency. My
guess is that a lower frequency operating point is not working and is
crashing the CPU.

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Attachment: signature.asc
Description: PGP signature