Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches

From: Sai Prakash Ranjan
Date: Fri Jan 24 2020 - 09:52:18 EST


Hi James,

On 2020-01-16 00:19, James Morse wrote:
Hi guys,

(CC: +Tyler)

On 13/01/2020 05:44, Sai Prakash Ranjan wrote:
On 2019-12-30 17:20, Borislav Petkov wrote:
On Thu, Dec 05, 2019 at 09:53:18AM +0000, Sai Prakash Ranjan wrote:
Kryo{3,4}XX CPU cores implement RAS extensions to support
Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.

via RAS what? ARM64_RAS_EXTN?

In any case, this needs James to look at and especially if there's some
ARM-generic functionality in there which should be shared, of course.

Yes it is ARM64_RAS_EXTN and I have been hoping if James can provide the feedback,
it has been some time now since I posted this out.

Sorry, I was out of the office for most of November/December, and I'm
slowly catching up...


+
+config EDAC_QCOM_KRYO_POLL
+ÂÂÂ depends on EDAC_QCOM_KRYO
+ÂÂÂ bool "Poll on Kryo ECC registers"
+ÂÂÂ help
+ÂÂÂÂÂ This option chooses whether or not you want to poll on the Kryo ECC
+ÂÂÂÂÂ registers. When this is enabled, the polling rate can be set as a
+ÂÂÂÂÂ module parameter. By default, it will call the polling function every
+ÂÂÂÂÂ second.

Why is this a separate option and why should people use that?

Can the polling/irq be switched automatically?

No it cannot be switched automatically. It is used in case some SoCs do not support an irq
based mechanism for EDAC.
But I am contradicting myself because I am telling that atleast one interrupt should be
specified in bindings,
so it is best if I drop this polling option for now.

For now, sure. But I think this will come back for systems with
embarrassing amounts of
RAM that would rather scrub the errors than take a flood of IRQs. I'd
like this to be
controllable from user-space.


Ok so we should have an option to switch between polling and irq.


diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index d77200c9680b..29edcfa6ec0e 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS)ÂÂÂÂÂÂÂ += synopsys_edac.o
Âobj-$(CONFIG_EDAC_XGENE)ÂÂÂÂÂÂÂ += xgene_edac.o
Âobj-$(CONFIG_EDAC_TI)ÂÂÂÂÂÂÂÂÂÂÂ += ti_edac.o
Âobj-$(CONFIG_EDAC_QCOM)ÂÂÂÂÂÂÂÂÂÂÂ += qcom_edac.o
+obj-$(CONFIG_EDAC_QCOM_KRYO)ÂÂÂÂÂÂÂ += qcom_kryo_edac.o

What is the difference between this new driver and the qcom_edac one? Can
functionality be shared?

High-level story time:
Until the 'v8.2' revision of the 'v8' Arm-architecture (the 64bit
one), arm didn't
describe how RAS should work. Partners implemented what they needed,
and we ended up with
this collection of drivers because they were all different.

v8.2 fixed all this, the good news is once its done, we should never
need another edac
driver. (at least, not for SoCs built for v8.2). The downside is there
is quite a lot in
there, and we need to cover ACPI machines as well as DT.


That is true but the qcom_edac one which is merged is for LLC(system cache) which is a QCOM IP.

qcom_edac driver is for QCOM system cache(last level cache), it should be renamed to
qcom_llcc_edac.c.
This new driver is for QCOM Kryo CPU core caches(L1,L2,L3).

Functionality cannot be shared as these two are different IP blocks and best kept separate.

The qcom_edac will be Qualcomm's pre-v8.2 support.
This series is about the v8.2 support which all looks totally
different to Linux.


As said before qcom_edac is for LLC which is not available on all SoCs.
QCOM's pre v8.2 support is not upstreamed.


+ * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3

Chapter? Where? URL?


I chose this because these TRMs are openly available and if you search for these above
terms like
"Cortex-A76 TRM Chapter B3.3" in google, then the first search result will be the TRM pdf,
otherwise
I would have to specify the long URL for the pdf and we do not know how long that URL link
will be active.

These are SoC/CPU specific. Using these we can't solve the whole problem.

The architecture all those should fit into is here:
https://static.docs.arm.com/ddi0587/cb/2019_07_05_DD_0587_C_b.pdf
(or https://developer.arm.com/docs/ and look for 'RAS')

... and the arm-arm.


Thanks for the link.


+static void dump_syndrome_reg(int error_type, int level,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ u64 errxstatus, u64 errxmisc,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct edac_device_ctl_info *edev_ctl)
+{
+ÂÂÂ char msg[KRYO_EDAC_MSG_MAX];
+ÂÂÂ const char *error_msg;
+ÂÂÂ int cpu;
+
+ÂÂÂ cpu = raw_smp_processor_id();

Why raw_?


Because we will be calling smp_processor_id in preemptible context and if we enable
CONFIG_DEBUG_PREEMPT,
we would get a nice backtrace.

[ÂÂÂ 3.747468] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
[ÂÂÂ 3.755527] caller is qcom_kryo_edac_probe+0x138/0x2b8
[ÂÂÂ 3.760819] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G SÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
5.4.0-rc7-next-20191113-00009-g8666855d6a5b-dirty #107
[ÂÂÂ 3.772323] Hardware name: Qualcomm Technologies, Inc. SM8150 MTP (DT)
[ÂÂÂ 3.779030] Call trace:
[ÂÂÂ 3.781556]Â dump_backtrace+0x0/0x158
[ÂÂÂ 3.785331]Â show_stack+0x14/0x20
[ÂÂÂ 3.788741]Â dump_stack+0xb0/0xf4
[ÂÂÂ 3.792164]Â debug_smp_processor_id+0xd8/0xe0
[ÂÂÂ 3.796639]Â qcom_kryo_edac_probe+0x138/0x2b8
[ÂÂÂ 3.801116]Â platform_drv_probe+0x50/0xa8
[ÂÂÂ 3.805236]Â really_probe+0x108/0x360
[ÂÂÂ 3.808999]Â driver_probe_device+0x58/0x100
[ÂÂÂ 3.813304]Â device_driver_attach+0x6c/0x78
[ÂÂÂ 3.817606]Â __driver_attach+0xb0/0xf0
[ÂÂÂ 3.821459]Â bus_for_each_dev+0x68/0xc8
[ÂÂÂ 3.825407]Â driver_attach+0x20/0x28
[ÂÂÂ 3.829083]Â bus_add_driver+0x160/0x1f0
[ÂÂÂ 3.833030]Â driver_register+0x60/0x110
[ÂÂÂ 3.836976]Â __platform_driver_register+0x40/0x48
[ÂÂÂ 3.841813]Â qcom_kryo_edac_driver_init+0x18/0x20
[ÂÂÂ 3.846645]Â do_one_initcall+0x58/0x1a0
[ÂÂÂ 3.850596]Â kernel_init_freeable+0x19c/0x240
[ÂÂÂ 3.855075]Â kernel_init+0x10/0x108
[ÂÂÂ 3.858665]Â ret_from_fork+0x10/0x1c

and raw_ stops the backtrace? You are still preemptible. The problem
still exists, you've
just suppressed the warning.

At any time in dump_syndrome_reg(), you could get an interrupt and
another task gets
scheduled. Later your thread is started on another cpu... but not the
one whose cpu number
you read from smp_processor_id(). Whatever you needed it for, might
have the wrong value.


Ok will correct this.


+static int kryo_l1_l2_setup_irq(struct platform_device *pdev,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct edac_device_ctl_info *edev_ctl)
+{
+ÂÂÂ int cpu, errirq, faultirq, ret;
+
+ÂÂÂ edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev);
+ÂÂÂ if (!edac_dev)
+ÂÂÂÂÂÂÂ return -ENOMEM;
+
+ÂÂÂ for_each_possible_cpu(cpu) {
+ÂÂÂÂÂÂÂ preempt_disable();
+ÂÂÂÂÂÂÂ per_cpu(edac_dev, cpu) = edev_ctl;
+ÂÂÂÂÂÂÂ preempt_enable();
+ÂÂÂ }

That sillyness doesn't belong here, if at all.

Sorry but I do not understand the sillyness here. Could you please explain?

preempt_disable() prevents another task being scheduled instead of
you, avoiding the risk
that you get scheduled on another cpu. In this case it doesn't matter
which cpu you are
running on as you aren't accessing _this_ cpu's edac_dev, you are
accessing each one in a
loop.


Thanks for the explanation James, now I get the sillyness.

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation