Re: [PATCH v11 01/11] phy: HiSilicon: Add driver for Kirin 970 PCIe PHY

From: Mauro Carvalho Chehab
Date: Wed Aug 18 2021 - 05:01:35 EST


Hi Vinod,

Em Tue, 17 Aug 2021 16:12:37 +0530
Vinod Koul <vkoul@xxxxxxxxxx> escreveu:

> > + /* FIXME: calling it causes an Asynchronous SError interrupt */
> > +// kirin_pcie_clk_ctrl(phy, false);
>
> when will you fix the fixme and pls remove the deadcode

Working with clocks on this SoC is very tricky: there are lots of clock
lines (~70) that are critical for this device to work. Such lines are
enabled via the Device's firmware, and are supposed to be always
powered. Powering off such clock lines cause a SError.

Most clocks on this device are managed by the clk-hikey3670 driver.
At the current state of clk-hi3670, the only way for HiKey 970
to even boot is to add:

clk_ignore_unused=true

as a Kernel boot parameter. That is the solution given by the downstream
official distributions for HiKey970 at 96boards.

The fix is to flag the critical clocks with CLK_IS_CRITICAL at the
clk-hi3670 driver, but finding the right clock set has been a challenge.

I spent the last couple of weeks trying to identify the critical ones,
as I'm aiming to be able to use a Kernel built with a default arm64
one of my goals is to have this device working fine with a
"make defconfig" Kernel.

So, I added this patch:

https://lore.kernel.org/lkml/2d2de5e902ced072bcfd5e5311d6b10326b9245b.1627041240.git.mchehab+huawei@xxxxxxxxxx/

to my tree (which reduces the set of clocks using CLK_IGNORE_UNUSED
from 308 to 163 clocks). Than I ran script that was dropping the
flag one by one, boots the new Kernel and do a sanity check. When it
fails to boot, I manually dropped the patch, and re-run the script
to test the remaining clocks. After a couple of weeks, I reached a patch
with 78 clock lines that seemed critical, but the resulting patch was
not stable, as, depending on the day I boot the Kernel with such patch,
it crashes with SError in a couple of seconds after booting, or
cause the Ethernet firmware to not load.

I intend to keep trying to find the clock lines that can't be disabled,
but this is very time consuming, as I couldn't find any documentation
about that. So, it has to be done empirically.

-

In any case, fixing it doesn't sound a critical issue for the PHY
driver. I mean, right now, this patchset allows removing and
re-inseting the PCIe driver, which is already an improvement over the
original upstream driver, which was missing the power-off logic for
Kirin 960.

With this patchset, both power-off/power-on logic for both HiKey960
(where the PHY is inside the pcie-kirin driver) and for HiKey970,
which uses this PHY driver. On both devices, I tested an endless loop
with rmmod/modprobe for the PCIe.

Besides that, in practice, removing PCIe in runtime is something that
people usually don't do.

So, while it would be cool to balance the clock disable logic,
I don't think this is a critical issue in this particular case.

Thanks,
Mauro