Re: [PATCH v2] drm/panfrost:report the full raw fault information instead

From: Robin Murphy
Date: Mon Jun 28 2021 - 10:18:06 EST

Next message: Sasha Levin: "[PATCH 5.12 000/110] 5.12.14-rc1 review"
Previous message: Alan Stern: "Re: [syzbot] INFO: rcu detected stall in tx"
In reply to: Steven Price: "Re: [PATCH v2] drm/panfrost:report the full raw fault information instead"
Next in thread: Chunyou Tang: "Re: [PATCH v2] drm/panfrost:report the full raw fault information instead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2021-06-28 11:48, Steven Price wrote:

On 25/06/2021 10:49, Chunyou Tang wrote:

Hi Steve,
Thinks for your reply.
When I only set the pte |= ARM_LPAE_PTE_SH_NS;there have no "GPU
Fault",When I set the pte |= ARM_LPAE_PTE_SH_IS(or
ARM_LPAE_PTE_SH_OS);there have "GPU Fault".I don't know how the pte
effect this issue?
Can you give me some suggestions again?

Thinks.

Chunyou

Hi Chunyou,

You haven't given me much context so I'm not entirely sure which PTE you
are talking about (GPU or CPU), or indeed where you are changing the PTE
values.

The PTEs control whether a page is shareable or not, the GPU requires
that accesses are consistent (i.e. either all accesses to a page are
shareable or all are non-shareable) and will race a fault if it detects
this isn't the case. Mali also has a quirk for its version of 'LPAE'
where inner shareable actually means only within the GPU and outer
shareable means outside the GPU (which I think usually means Inner
Shareable on the external bus).

Furthermore, the way the io-pgtable code works for ARM_MALI_LPAE format means that *all* GPU mappings are unconditionally outer-shareable, so it's not clear how the GPU could observe a mismatch in the first place (other than major integration issues causing data corruption).

Robin.

Steve

于 Thu, 24 Jun 2021 14:22:04 +0100
Steven Price <steven.price@xxxxxxx> 写道:

On 22/06/2021 02:40, Chunyou Tang wrote:

Hi Steve,
I will send a new patch with suitable subject/commit
message. But I send a V3 or a new patch?

Send a V3 - it is a new version of this patch.

I met a bug about the GPU,I have no idea about how to fix
it, If you can give me some suggestion,it is perfect.

You can see such kernel log:

Jun 20 10:20:13 icube kernel: [ 774.566760] mvp_gpu 0000:05:00.0:
GPU Fault 0x00000088 (SHAREABILITY_FAULT) at 0x000000000310fd00 Jun
20 10:20:13 icube kernel: [ 774.566764] mvp_gpu 0000:05:00.0:
There were multiple GPU faults - some have not been reported Jun 20
10:20:13 icube kernel: [ 774.667542] mvp_gpu 0000:05:00.0:
AS_ACTIVE bit stuck Jun 20 10:20:13 icube kernel: [ 774.767900]
mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck Jun 20 10:20:13 icube
kernel: [ 774.868546] mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck
Jun 20 10:20:13 icube kernel: [ 774.968910] mvp_gpu 0000:05:00.0:
AS_ACTIVE bit stuck Jun 20 10:20:13 icube kernel: [ 775.069251]
mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck Jun 20 10:20:22 icube
kernel: [ 783.693971] mvp_gpu 0000:05:00.0: gpu sched timeout,
js=1, config=0x7300, status=0x8, head=0x362c900, tail=0x362c100,
sched_job=000000003252fb84

In
https://lore.kernel.org/dri-devel/20200510165538.19720-1-peron.clem@xxxxxxxxx/
there had a same bug like mine,and I found you at the mail list,I
don't know how it fixed?

The GPU_SHAREABILITY_FAULT error means that a cache line has been
accessed both as shareable and non-shareable and therefore coherency
cannot be guaranteed. Although the "multiple GPU faults" means that
this may not be the underlying cause.

The fact that your dmesg log has PCI style identifiers
("0000:05:00.0") suggests this is an unusual platform - I've not
previously been aware of a Mali device behind PCI. Is this device
working with the kbase/DDK proprietary driver? It would be worth
looking at the kbase kernel code for the platform to see if there is
anything special done for the platform.

From the dmesg logs all I can really tell is that the GPU seems
unhappy about the memory system.

Steve

I need your help!

thinks very much!

Chunyou

于 Mon, 21 Jun 2021 11:45:20 +0100
Steven Price <steven.price@xxxxxxx> 写道:

On 19/06/2021 04:18, Chunyou Tang wrote:

Hi Steve,
1,Now I know how to write the subject
2,the low 8 bits is the exception type in spec.

and you can see prnfrost_exception_name()

switch (exception_code) {
/* Non-Fault Status code */
case 0x00: return "NOT_STARTED/IDLE/OK";
case 0x01: return "DONE";
case 0x02: return "INTERRUPTED";
case 0x03: return "STOPPED";
case 0x04: return "TERMINATED";
case 0x08: return "ACTIVE";
........
........
case 0xD8: return "ACCESS_FLAG";
case 0xD9 ... 0xDF: return "ACCESS_FLAG";
case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
}
return "UNKNOWN";
}

the exception_code in case is only 8 bits,so if fault_status
in panfrost_gpu_irq_handler() don't & 0xFF,it can't get correct
exception reason,it will be always UNKNOWN.

Yes, I'm happy with the change - I just need a patch that I can
apply. At the moment this patch only changes the first '0x%08x'
output rather than the call to panfrost_exception_name() as well.
So we just need a patch which does:

- fault_status & 0xFF, panfrost_exception_name(pfdev,
fault_status),
+ fault_status, panfrost_exception_name(pfdev, fault_status &
0xFF),

along with a suitable subject/commit message describing the
change. If you can send me that I can apply it.

Thanks,

Steve

PS. Sorry for going round in circles here - I'm trying to help you
get setup so you'll be able to contribute patches easily in
future. An important part of that is ensuring you can send a
properly formatted patch to the list.

PPS. I'm still not receiving your emails directly. I don't think
it's a problem at my end because I'm receiving other emails, but
if you can somehow fix the problem you're likely to receive a
faster response.

于 Fri, 18 Jun 2021 13:43:24 +0100
Steven Price <steven.price@xxxxxxx> 写道:

On 17/06/2021 07:20, ChunyouTang wrote:

From: ChunyouTang <tangchunyou@xxxxxxxxxxxx>

of the low 8 bits.

Please don't split the subject like this. The first line of the
commit should be a (very short) summary of the patch. Then a
blank line and then a longer description of what the purpose of
the patch is and why it's needed.

Also you previously had this as part of a series (the first part
adding the "& 0xFF" in the panfrost_exception_name() call). I'm
not sure we need two patches for the single line, but as it
stands this patch doesn't apply.

Also I'm still not receiving any emails from you directly (only
via the list), so it's possible I might have missed something
you sent.

Steve

Signed-off-by: ChunyouTang <tangchunyou@xxxxxxxxxxxx>
---
drivers/gpu/drm/panfrost/panfrost_gpu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c
b/drivers/gpu/drm/panfrost/panfrost_gpu.c index
1fffb6a0b24f..d2d287bbf4e7 100644 ---
a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++
b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -33,7 +33,7 @@
static irqreturn_t panfrost_gpu_irq_handler(int irq, void
*data) address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at
0x%016llx\n",
- fault_status & 0xFF,
panfrost_exception_name(pfdev, fault_status & 0xFF),
+ fault_status,
panfrost_exception_name(pfdev, fault_status & 0xFF), address);
if (state & GPU_IRQ_MULTIPLE_FAULT)

Next message: Sasha Levin: "[PATCH 5.12 000/110] 5.12.14-rc1 review"
Previous message: Alan Stern: "Re: [syzbot] INFO: rcu detected stall in tx"
In reply to: Steven Price: "Re: [PATCH v2] drm/panfrost:report the full raw fault information instead"
Next in thread: Chunyou Tang: "Re: [PATCH v2] drm/panfrost:report the full raw fault information instead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]