Re: Crashes under Xen with Radeon graphics card

From: Juergen Gross
Date: Fri Dec 15 2023 - 11:33:38 EST


On 15.12.23 17:19, Deucher, Alexander wrote:
[AMD Official Use Only - General]

-----Original Message-----
From: Juergen Gross <jgross@xxxxxxxx>
Sent: Friday, December 15, 2023 11:13 AM
To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; lkml <linux-
kernel@xxxxxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Pan, Xinhui
<Xinhui.Pan@xxxxxxx>
Subject: Re: Crashes under Xen with Radeon graphics card

On 15.12.23 17:04, Deucher, Alexander wrote:
[Public]

-----Original Message-----
From: Juergen Gross <jgross@xxxxxxxx>

...

The crashes vary, but often the kernel accesses non-canonical
addresses or tries to map illegal physical addresses. Sometimes the
system is just hanging, either with softlockups or without any further signs
of being alive.

I can easily reproduce the problem, so any debug patches to narrow
down the problem are welcome.

There are still missing firmware required for proper operation. Please fix
them up.

That was the starting point, of course!

Ah, ok. Thanks for clarifying. What exactly happens when you get this crash? System hang? Kernel oops? Is there anything in the dmesg when it happens?

As I wrote above: rather different cases. The crash happens normally
within 20 seconds after the system is completely up. I had one case
where it survived ca. 2 minutes.

One example:

[ 64.549114] BUG: unable to handle page fault for address: ffff888121291000
[ 64.562850] #PF: supervisor write access in kernel mode
[ 64.573352] #PF: error_code(0x0003) - permissions violation
[ 64.584589] PGD 2836067 P4D 2836067 PUD 3e73f7067 PMD 3e72ed067 PTE 8010000121291025
[ 64.600212] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 64.608985] CPU: 3 PID: 2090 Comm: kioslave5 Tainted: G E 6.7.0-rc5-default #974
[ 64.626721] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A25 05/30/2019
[ 64.641193] RIP: e030:clear_page_erms+0x7/0x10
[ 64.650161] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 64.687996] RSP: e02b:ffffc9004206fb50 EFLAGS: 00010246
[ 64.698378] RAX: 0000000000000000 RBX: ffffea000484a400 RCX: 0000000000001000
[ 64.712780] RDX: 0000000000052dc0 RSI: 0000000000000003 RDI: ffff888121291000
[ 64.727154] RBP: 0000000000000901 R08: ffffea000484a440 R09: ffffea000484a600
[ 64.741491] R10: 0000000000000002 R11: 000000000000241e R12: ffff8883e7d21d80
[ 64.755843] R13: 000000000028d834 R14: 0000000000000901 R15: ffffea000484a400
[ 64.770207] FS: 00007f4c2b79d280(0000) GS:ffff888409380000(0000) knlGS:0000000000000000
[ 64.786487] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 64.798019] CR2: ffff888121291000 CR3: 000000014fef4000 CR4: 0000000000050660
[ 64.812411] Call Trace:
[ 64.817308] <TASK>
[ 64.821625] ? __die_body+0x1a/0x60
[ 64.828746] ? page_fault_oops+0x151/0x470
[ 64.837065] ? search_bpf_extables+0x65/0x70
[ 64.845717] ? fixup_exception+0x22/0x320
[ 64.853844] ? exc_page_fault+0xb3/0x150
[ 64.861792] ? asm_exc_page_fault+0x22/0x30
[ 64.870275] ? clear_page_erms+0x7/0x10
[ 64.878050] prep_new_page+0x97/0xb0
[ 64.885308] get_page_from_freelist+0x7a4/0x1f40
[ 64.894678] __alloc_pages+0x18b/0x350
[ 64.902270] ? kvmalloc_node+0x3a/0xd0
[ 64.909892] __kmalloc_large_node+0x7a/0x140
[ 64.918542] __kmalloc_node+0xc1/0x130
[ 64.926149] kvmalloc_node+0x3a/0xd0
[ 64.933399] proc_sys_call_handler+0xfa/0x230
[ 64.942259] vfs_read+0x22f/0x2e0
[ 64.949007] ksys_read+0xa5/0xe0
[ 64.955527] do_syscall_64+0x5d/0xe0
[ 64.962806] ? do_user_addr_fault+0x5b3/0x8a0
[ 64.971647] ? exc_page_fault+0x6f/0x150
[ 64.979587] entry_SYSCALL_64_after_hwframe+0x6f/0x77
[ 64.989821] RIP: 0033:0x7f4c29f06a3e
[ 64.997098] Code: 08 e8 f4 1e 02 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a f3 c3 0f 1f 84 00 00 00 00 00 41 54 55 49
[ 65.034962] RSP: 002b:00007ffd5a86f2b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 65.050071] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4c29f06a3e
[ 65.064415] RDX: 0000000000004000 RSI: 0000000002562c18 RDI: 0000000000000004
[ 65.078775] RBP: 0000000002561d60 R08: 00007f4c2abd3418 R09: 0000000000000028
[ 65.093155] R10: 000000000253b010 R11: 0000000000000246 R12: 0000000000004000
[ 65.107492] R13: 0000000000004000 R14: 0000000000000004 R15: 0000000002562c18
[ 65.121850] </TASK>



BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
the patch series merging swiotlb and swiotlb-xen could be to blame, but that
went into v5.19.

Can you bisect?

I can try to find the offending commit, sure. I just wanted to share my current
findings in the hope that someone might have an idea ...


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature