Re: [PATCH v3] clk: HiSilicon: clk-hi3670: mark critical clocks as such

From: Mauro Carvalho Chehab
Date: Fri Aug 20 2021 - 14:40:31 EST


Em Fri, 20 Aug 2021 23:22:33 +0530
Manivannan Sadhasivam <mani@xxxxxxxxxx> escreveu:

> On Fri, Aug 20, 2021 at 05:35:43PM +0200, Mauro Carvalho Chehab wrote:
> > The fixed-rate and fixed_factor clocks are critical to the
> > system: disabling them causes HiKey970 to not finish booting.
> >
>
> You mean all fixed rate and fixed factor clocks are critical to the system?
> I don't think so.

I tested one by one a couple of times. It took me ~2 weeks or so to produce
this patch. I applied first a patch adding CLK_IGNORE_UNUSED to
all clocks. Then, I broke the big change into hundreds of individual
patches with:

TOP=$(git log -1 --pretty=%h)

mkdir -p patches/
rm patches/*

COUNT=$(git grep CLK_IGNORE_UNUSED drivers/clk/hisilicon/clk-hi3670.c|wc -l)
for i in $(seq 1 $COUNT); do
sed '0,/ | CLK_IGNORE_UNUSED/s///' -i drivers/clk/hisilicon/clk-hi3670.c; git commit -a -m "change $i" --no-edit
done

git format-patch ${TOP}.. -o patches/
(cd patches/; ls *.patch >series)

git reset --hard $TOP
git commit -a -m TEST --no-edit --allow-empty

Then, I ran this script in order to test each change individually[1]:

#!/bin/bash
trap ctrl_c INT

function ctrl_c() {
echo "** Aborting script"
exit 1
}

LANG=C

LAST=$(cat patches/series|sort -n|tail -1)

while :; do
TOP=$(quilt top)
echo $TOP

git commit --amend -a --allow-empty --no-edit
./test970.sh
./test_host || break
ssh hikey970 ./check_kernel
if [ "$?" != "0" ]; then
TOP=`quilt top`
quilt pop
toquilt $TOP
fi
if [ "$TOP" == "patches/$LAST" ]; then
echo "Finishing."
exit 0
fi
quilt push
done

Every time the script fails, I had to power off it by hand, power it
on and then boot it with a working Kernel, moving the offending patch
to the end of quilt, with:

TOP=`quilt top`; quilt pop; toquilt $TOP

[1] This script uses a couple of ancillary scripts:

a) test970.sh:
pushes the change from my git tree into the HiKey970,
builds a new kernel and boots it;
b) test_host:
check if the board's network came up after booting;
c) ./check_kernel:
checks if the boot succeeded and the board is at a sane state;
d) toquilt:
moves a patch to the end of the quilt tree.

Once I finished. I re-checked the patches, in order to double check
that the unapplied patches are all causing issues.

At the end, I renamed the flags with:

$ sed s,CLK_IGNORE_UNUSED,CLK_IS_CRITICAL, -i drivers/clk/hisilicon/clk-hi3670.c

So, yeah, I'm pretty sure that fixed rate and fixed factor clocks
are all critical to the system.

> You should identify the gateable clocks that are causing the issues and see if
> they can be marked as critical.

They were already checked by the above procedure.

>
> Also, at what point of time you are seeing the boot issue?

That depends on the clock. See, this device is taking ~11 seconds to finish
booting and starting a prompt at the console here.

Most of critical clocks cause a panic() with SError up to ~2 seconds after
start booting:

[ 2.173211] SError Interrupt on CPU3, code 0xbf000002 -- SError
[ 2.173216] CPU: 3 PID: 102 Comm: kworker/3:1H Not tainted 5.14.0-rc1+ #723
[ 2.173218] Hardware name: HiKey970 (DT)
[ 2.173220] Workqueue: kblockd blk_mq_run_work_fn
[ 2.173226] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO BTYPE=--)
[ 2.173228] pc : el1_interrupt+0x20/0x80
[ 2.173229] lr : el1h_64_irq_handler+0x18/0x24
[ 2.173231] sp : ffff80001283b920
[ 2.173233] x29: ffff80001283b920 x28: ffff000100b6c740 x27: ffff0001011e4040
[ 2.173239] x26: 0000000000000001 x25: ffff0001011e4dc0 x24: 000000000000001b
[ 2.173244] x23: 0000000080000005 x22: ffff800010ed5078 x21: ffff80001283baa0
[ 2.173249] x20: ffff8000104e42e0 x19: ffff80001283b950 x18: ffff8000121db8c0
[ 2.173255] x17: 0000000000001800 x16: 0000000000000000 x15: 0000000000000000
[ 2.173260] x14: 00000000000002a7 x13: ffff000101d56298 x12: 0000000000000001
[ 2.173266] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
[ 2.173271] x8 : 0000000000000000 x7 : 0001000000a30100 x6 : 00000000016ec5e1
[ 2.173276] x5 : 00ffffffffffffff x4 : 0000000000000000 x3 : ffff0001011e2030
[ 2.173281] x2 : 0000000000000000 x1 : 00000000000000c0 x0 : ffff80001283b950
[ 2.173288] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 2.173290] CPU: 3 PID: 102 Comm: kworker/3:1H Not tainted 5.14.0-rc1+ #723
[ 2.173292] Hardware name: HiKey970 (DT)
[ 2.173294] Workqueue: kblockd blk_mq_run_work_fn
[ 2.173297] Call trace:
[ 2.173298] dump_backtrace+0x0/0x1e0
[ 2.173300] show_stack+0x18/0x24
[ 2.173302] dump_stack_lvl+0x68/0x84
[ 2.173304] dump_stack+0x18/0x34
[ 2.173305] panic+0x16c/0x334
[ 2.173307] nmi_panic+0x8c/0x90
[ 2.173309] arm64_serror_panic+0x78/0x84
[ 2.173310] do_serror+0x58/0x5c
[ 2.173312] el1h_64_error_handler+0x30/0x50
[ 2.173314] el1h_64_error+0x78/0x7c
[ 2.173316] el1_interrupt+0x20/0x80
[ 2.173318] el1h_64_irq_handler+0x18/0x24
[ 2.173319] el1h_64_irq+0x78/0x7c
[ 2.173321] _raw_spin_unlock_irqrestore+0x18/0x50
[ 2.173323] ufshcd_queuecommand+0x34c/0x480
[ 2.173324] scsi_queue_rq+0x420/0x96c
[ 2.173327] blk_mq_dispatch_rq_list+0x118/0x800
[ 2.173328] __blk_mq_do_dispatch_sched+0x238/0x2b4
[ 2.173330] __blk_mq_sched_dispatch_requests+0x110/0x170
[ 2.173332] blk_mq_sched_dispatch_requests+0x3c/0x80
[ 2.173334] __blk_mq_run_hw_queue+0x54/0x90
[ 2.173336] blk_mq_run_work_fn+0x24/0x30
[ 2.173338] process_one_work+0x1cc/0x350
[ 2.173340] worker_thread+0x138/0x46c
[ 2.173341] kthread+0x150/0x160
[ 2.173343] ret_from_fork+0x10/0x18
[ 2.173365] SMP: stopping secondary CPUs
[ 2.173367] Kernel Offset: disabled
[ 2.173369] CPU features: 0x00003051,00000846
[ 2.173371] Memory Limit: none

a couple of clocks that cause the device to stop producing any dmesg
after ~2 seconds:

...
[ 1.886460] printk: bootconsole [pl11] disabled
[ 1.886460] printk: bootconsole [pl11] disabled
[ 1.924965] i2c 0-004e: Fixing up cyclic dependency with hikey_usbhub
[ 1.925433] sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc7 sdc8 sdc9 sdc10 sdc11 sdc12
[ 1.935446] dwmmc_k3 fc183000.dwmmc2: fifo-depth property not found, using value of FIFOTH register as default
[ 1.935977] ALSA device list:
[ 1.935977] No soundcards found.
[ 1.935977] sd 0:0:0:3: [sdc] Attached SCSI disk
[ 1.935977] uart-pl011 fff32000.serial: no DMA platform data
<no more dmesg>

There are some clocks that cause SError panic() after ~4 seconds or so,
when some driver would be probed, like this one:

[ 4.012925] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 4.012926] CPU: 5 PID: 142 Comm: kworker/u16:3 Not tainted 5.14.0-rc1+ #718
[ 4.012927] Hardware name: HiKey970 (DT)
[ 4.012928] Workqueue: events_unbound deferred_probe_work_func
[ 4.012930] Call trace:
[ 4.012930] dump_backtrace+0x0/0x1e0
[ 4.012931] show_stack+0x18/0x24
[ 4.012932] dump_stack_lvl+0x68/0x84
[ 4.012932] dump_stack+0x18/0x34
[ 4.012933] panic+0x16c/0x334
[ 4.012934] nmi_panic+0x8c/0x90
[ 4.012934] arm64_serror_panic+0x78/0x84
[ 4.012935] do_serror+0x58/0x5c
[ 4.012936] el1h_64_error_handler+0x30/0x50
[ 4.012937] el1h_64_error+0x78/0x7c
[ 4.012938] pin_request+0x8c/0x2b0
[ 4.012939] pinmux_request_gpio+0x50/0x9c
[ 4.012939] pinctrl_gpio_request+0x64/0x1bc
[ 4.012940] gpiochip_generic_request+0x30/0x4c
[ 4.012941] gpiod_request_commit+0xf8/0x1d0
[ 4.012942] gpiod_request.part.0+0xd0/0x10c
[ 4.012942] gpiod_request+0x44/0xd0
[ 4.012943] gpio_request+0x30/0x50
[ 4.012944] devm_gpio_request+0x54/0xb0
[ 4.012945] kirin_pcie_probe+0x738/0x88c [pcie_kirin]
[ 4.012945] platform_probe+0x68/0xe0
[ 4.012946] really_probe+0x1b0/0x42c
[ 4.012947] __driver_probe_device+0x114/0x190
[ 4.012948] driver_probe_device+0x40/0x100
[ 4.012949] __device_attach_driver+0x98/0x130
[ 4.012950] bus_for_each_drv+0x78/0xd0
[ 4.012951] __device_attach+0xdc/0x1c0
[ 4.012951] device_initial_probe+0x14/0x20
[ 4.012952] bus_probe_device+0x98/0xa0
[ 4.012953] deferred_probe_work_func+0x9c/0xf0
[ 4.012954] process_one_work+0x1cc/0x350
[ 4.012954] worker_thread+0x138/0x46c
[ 4.012955] kthread+0x150/0x160
[ 4.012956] ret_from_fork+0x10/0x18
[ 4.012978] SMP: stopping secondary CPUs
[ 4.012979] Kernel Offset: disabled
[ 4.012979] CPU features: 0x00003051,00000846
[ 4.012980] Memory Limit: none


There are some clocks that make the device to not panic(). Instead, it seems
to disable the access to the disk, causing UFS to fail:

[ 2.701778] ufshcd-hisi ff3c0000.ufs: uic cmd 0x1 with arg3 0x0 completion timeout
[ 2.709352] ufshcd-hisi ff3c0000.ufs: dme-get: attr-id 0x1571 failed 0 retries
[ 2.817775] ufshcd-hisi ff3c0000.ufs: __ufshcd_issue_tm_cmd: task management cmd 0x80 timed-out
[ 3.830445] ------------[ cut here ]------------
[ 3.835068] ufshcd-hisi ff3c0000.ufs: __ufshcd_issue_tm_cmd: unable to clear tm cmd (slot 2) after timeout
[ 3.844736] WARNING: CPU: 6 PID: 140 at drivers/scsi/ufs/ufshcd.c:6575 __ufshcd_issue_tm_cmd+0x2dc/0x2fc
[ 3.854225] Modules linked in:
[ 3.857277] CPU: 6 PID: 140 Comm: kworker/u16:2 Not tainted 5.14.0-rc1+ #725
[ 3.864321] Hardware name: HiKey970 (DT)
[ 3.868238] Workqueue: ufs_eh_wq_0 ufshcd_err_handler
[ 3.873288] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 3.879289] pc : __ufshcd_issue_tm_cmd+0x2dc/0x2fc
[ 3.884075] lr : __ufshcd_issue_tm_cmd+0x2dc/0x2fc
[ 3.888858] sp : ffff800012a3bb90
[ 3.892164] x29: ffff800012a3bb90 x28: 00000000000000a0 x27: ffff0001011e2920
[ 3.899297] x26: ffff0001011e2000 x25: 0000000000000000 x24: ffff0001014e0280
[ 3.906429] x23: 0000000000000004 x22: 0000000000000000 x21: 0000000000000002
[ 3.913560] x20: ffff80001115cf98 x19: ffff0001011e2860 x18: 0000000000000000
[ 3.920691] x17: 7261656c63206f74 x16: 20656c62616e7520 x15: 0720072007200720
[ 3.927822] x14: 0720072007200720 x13: ffff0001b87e0000 x12: 0000000000000444
[ 3.934953] x11: 000000000000016c x10: ffff0001bf2386c0 x9 : 00000000ffff0000
[ 3.942083] x8 : ffff0001b87e0000 x7 : ffff0001bf2386c0 x6 : 0000000000000000
[ 3.949215] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000ffffffff
[ 3.956345] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000101071c80
[ 3.963477] Call trace:
[ 3.965916] __ufshcd_issue_tm_cmd+0x2dc/0x2fc
[ 3.970353] ufshcd_issue_tm_cmd+0x74/0xf0
[ 3.974442] ufshcd_try_to_abort_task+0x84/0x21c
[ 3.979053] ufshcd_err_handler+0x4a0/0xad0
[ 3.983229] process_one_work+0x1cc/0x350
[ 3.987236] worker_thread+0x138/0x46c
[ 3.990978] kthread+0x150/0x160
[ 3.994200] ret_from_fork+0x10/0x18
[ 3.997772] ---[ end trace 7cf4829a5cdb528e ]---
[ 4.002392] ufshcd-hisi ff3c0000.ufs: ufshcd_try_to_abort_task: no response from device. tag = 18, err -110
[ 4.045167] ufshcd-hisi ff3c0000.ufs: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
[ 4.074477] ufshcd-hisi ff3c0000.ufs: ufshcd_print_pwr_info:[RX, TX]: gear=[3, 3], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
[ 5.614829] ufshcd-hisi ff3c0000.ufs: __ufshcd_query_descriptor: opcode 0x01 for idn 8 failed, index 0, err = -11
[ 7.150830] ufshcd-hisi ff3c0000.ufs: __ufshcd_query_descriptor: opcode 0x01 for idn 8 failed, index 0, err = -11
[ 8.686827] ufshcd-hisi ff3c0000.ufs: __ufshcd_query_descriptor: opcode 0x01 for idn 8 failed, index 0, err = -11
[ 8.697091] ufshcd-hisi ff3c0000.ufs: ufshcd_read_desc_param: Failed reading descriptor. desc_id 8, desc_index 0, param_offset 0, ret -11
[ 8.709437] ufshcd-hisi ff3c0000.ufs: ufshcd_set_active_icc_lvl: Failed reading power descriptor.len = 98 ret = -11
[ 8.756444] ufshcd-hisi ff3c0000.ufs: ufshcd_intr: Unhandled interrupt 0x00000000 (0x00000010, 0x00000010)
...
[ 98.230000] ufshcd-hisi ff3c0000.ufs: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
[ 98.259323] ufshcd-hisi ff3c0000.ufs: ufshcd_print_pwr_info:[RX, TX]: gear=[3, 3], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
[ 99.790788] ufshcd-hisi ff3c0000.ufs: __ufshcd_query_descriptor: opcode 0x01 for idn 8 failed, index 0, err = -11
[ 101.326787] ufshcd-hisi ff3c0000.ufs: __ufshcd_query_descriptor: opcode 0x01 for idn 8 failed, index 0, err = -11
[ 102.862787] ufshcd-hisi ff3c0000.ufs: __ufshcd_query_descriptor: opcode 0x01 for idn 8 failed, index 0, err = -11
[ 102.873051] ufshcd-hisi ff3c0000.ufs: ufshcd_read_desc_param: Failed reading descriptor. desc_id 8, desc_index 0, param_offset 0, ret -11
[ 102.885396] ufshcd-hisi ff3c0000.ufs: ufshcd_set_active_icc_lvl: Failed reading power descriptor.len = 98 ret = -11
[ 102.932393] ufshcd-hisi ff3c0000.ufs: ufshcd_intr: Unhandled interrupt 0x00000000 (0x00000010, 0x00000010)
...

Thanks,
Mauro