Re: [Intel-wired-lan] bug with rx-udp-gro-forwarding offloading?

From: Ian Kumlien
Date: Mon Jun 26 2023 - 14:20:26 EST


Nevermind, I think I found it, I will loop this thing until I have a
proper trace....

On Mon, Jun 26, 2023 at 8:01 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
>
> On Mon, Jun 26, 2023 at 7:56 PM Paolo Abeni <pabeni@xxxxxxxxxx> wrote:
> >
> > On Mon, 2023-06-26 at 19:30 +0200, Ian Kumlien wrote:
> > > There, that didn't take long, even with wireguard disabled
> > >
> > > [14079.678380] BUG: kernel NULL pointer dereference, address: 00000000000000c0
> > > [14079.685456] #PF: supervisor read access in kernel mode
> > > [14079.690686] #PF: error_code(0x0000) - not-present page
> > > [14079.695915] PGD 0 P4D 0
> > > [14079.698540] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > [14079.702996] CPU: 11 PID: 891 Comm: napi/eno2-80 Not tainted 6.4.0 #360
> > > [14079.709614] Hardware name: Supermicro Super Server/A2SDi-12C-HLN4F,
> > > BIOS 1.7a 10/13/2022
> > > [14079.717796] RIP: 0010:__udp_gso_segment+0x346/0x4f0
> > > [14079.722778] Code: c3 08 66 89 5c 02 04 45 84 e4 0f 85 27 fd ff ff
> > > 49 8b 1e 49 8b ae c0 00 00 00 41 0f b7 86 b4 00 00 00 45 0f b7 a6 b2
> > > 00 00 00 <48> 8b b3 c0 00 00 00 0f b7 8b b2 00 00 00 49 01 ec 48 01 c5
> > > 48 8d
> > > [14079.741645] RSP: 0018:ffffa83643a4f818 EFLAGS: 00010246
> > > [14079.746966] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000
> > > [14079.754195] RDX: ffffa2ad1403b000 RSI: 0000000000000028 RDI: ffffa2afc9d302d4
> > > [14079.761422] RBP: ffffa2ad1403b000 R08: 0000000000000022 R09: 00002000001558c9
> > > [14079.768650] R10: 0000000000000000 R11: ffffa2b02fcea888 R12: 00000000000000e2
> > > [14079.775879] R13: ffffa2afc9d30200 R14: ffffa2afc9d30200 R15: 00002000001558c9
> > > [14079.783106] FS: 0000000000000000(0000) GS:ffffa2b02fcc0000(0000)
> > > knlGS:0000000000000000
> > > [14079.791305] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [14079.797162] CR2: 00000000000000c0 CR3: 0000000151ff4000 CR4: 00000000003526e0
> > > [14079.804408] Call Trace:
> > > [14079.806961] <TASK>
> > > [14079.809170] ? __die+0x1a/0x60
> > > [14079.812340] ? page_fault_oops+0x158/0x440
> > > [14079.816551] ? ip6_route_output_flags+0xe3/0x160
> > > [14079.821284] ? exc_page_fault+0x3f4/0x820
> > > [14079.825408] ? update_load_avg+0x77/0x710
> > > [14079.829534] ? asm_exc_page_fault+0x22/0x30
> > > [14079.833836] ? __udp_gso_segment+0x346/0x4f0
> > > [14079.838218] ? __udp_gso_segment+0x2fa/0x4f0
> > > [14079.842600] ? _raw_spin_unlock_irqrestore+0x16/0x30
> > > [14079.847679] ? try_to_wake_up+0x8e/0x5a0
> > > [14079.851713] inet_gso_segment+0x150/0x3c0
> > > [14079.855827] ? vhost_poll_wakeup+0x31/0x40
> > > [14079.860032] skb_mac_gso_segment+0x9b/0x110
> > > [14079.864331] __skb_gso_segment+0xae/0x160
> > > [14079.868455] ? netif_skb_features+0x144/0x290
> > > [14079.872928] validate_xmit_skb+0x167/0x370
> > > [14079.877139] validate_xmit_skb_list+0x43/0x70
> > > [14079.881612] sch_direct_xmit+0x267/0x380
> > > [14079.885641] __qdisc_run+0x140/0x590
> > > [14079.889324] __dev_queue_xmit+0x44d/0xba0
> > > [14079.893450] ? nf_hook_slow+0x3c/0xb0
> > > [14079.897229] br_dev_queue_push_xmit+0xb2/0x1c0
> > > [14079.901788] maybe_deliver+0xa9/0x100
> > > [14079.905564] br_flood+0x8a/0x180
> > > [14079.908903] br_handle_frame_finish+0x31f/0x5b0
> > > [14079.913547] br_handle_frame+0x28f/0x3a0
> > > [14079.917585] ? ipv6_find_hdr+0x1f0/0x3e0
> > > [14079.921622] ? br_handle_local_finish+0x20/0x20
> > > [14079.926267] __netif_receive_skb_core.constprop.0+0x4c5/0xc90
> > > [14079.932125] ? br_handle_frame_finish+0x5b0/0x5b0
> > > [14079.936946] ? ___slab_alloc+0x4bf/0xaf0
> > > [14079.940986] __netif_receive_skb_list_core+0x107/0x250
> > > [14079.946240] netif_receive_skb_list_internal+0x194/0x2b0
> > > [14079.951660] ? napi_gro_flush+0x97/0xf0
> > > [14079.955604] napi_complete_done+0x69/0x180
> > > [14079.959808] ixgbe_poll+0xe10/0x12e0
> > > [14079.963506] __napi_poll+0x26/0x1b0
> > > [14079.967106] napi_threaded_poll+0x232/0x250
> > > [14079.971405] ? __napi_poll+0x1b0/0x1b0
> > > [14079.975260] kthread+0xee/0x120
> > > [14079.978510] ? kthread_complete_and_exit+0x20/0x20
> > > [14079.983415] ret_from_fork+0x22/0x30
> > > [14079.987102] </TASK>
> > > [14079.989395] Modules linked in: chaoskey
> > > [14079.993347] CR2: 00000000000000c0
> > > [14079.996773] ---[ end trace 0000000000000000 ]---
> > > [14080.018013] pstore: backend (erst) writing error (-28)
> > > [14080.023274] RIP: 0010:__udp_gso_segment+0x346/0x4f0
> > > [14080.028264] Code: c3 08 66 89 5c 02 04 45 84 e4 0f 85 27 fd ff ff
> > > 49 8b 1e 49 8b ae c0 00 00 00 41 0f b7 86 b4 00 00 00 45 0f b7 a6 b2
> > > 00 00 00 <48> 8b b3 c0 00 00 00 0f b7 8b b2 00 00 00 49 01 ec 48 01 c5
> > > 48 8d
> > > [14080.047181] RSP: 0018:ffffa83643a4f818 EFLAGS: 00010246
> > > [14080.052522] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000
> > > [14080.059765] RDX: ffffa2ad1403b000 RSI: 0000000000000028 RDI: ffffa2afc9d302d4
> > > [14080.067012] RBP: ffffa2ad1403b000 R08: 0000000000000022 R09: 00002000001558c9
> > > [14080.074257] R10: 0000000000000000 R11: ffffa2b02fcea888 R12: 00000000000000e2
> > > [14080.081502] R13: ffffa2afc9d30200 R14: ffffa2afc9d30200 R15: 00002000001558c9
> > > [14080.088746] FS: 0000000000000000(0000) GS:ffffa2b02fcc0000(0000)
> > > knlGS:0000000000000000
> > > [14080.096964] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [14080.102823] CR2: 00000000000000c0 CR3: 0000000151ff4000 CR4: 00000000003526e0
> > > [14080.110067] Kernel panic - not syncing: Fatal exception in interrupt
> > > [14080.325501] Kernel Offset: 0x12600000 from 0xffffffff81000000
> > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > [14080.353129] ---[ end Kernel panic - not syncing: Fatal exception in
> > > interrupt ]---
> >
> > Could you please provide a decoded stack trace?
> >
> > # in your git tree:
> > cat <stacktrace file > | ./scripts/decode_stacktrace.sh vmlinux
>
> I'm afraid it doesn't yield more information, really... I can't say why
>
> cat bug.txt | ./scripts/decode_stacktrace.sh vmlinux
> [14079.678380] BUG: kernel NULL pointer dereference, address: 00000000000000c0
> [14079.685456] #PF: supervisor read access in kernel mode
> [14079.690686] #PF: error_code(0x0000) - not-present page
> [14079.695915] PGD 0 P4D 0
> [14079.698540] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [14079.702996] CPU: 11 PID: 891 Comm: napi/eno2-80 Not tainted 6.4.0 #360
> [14079.709614] Hardware name: Supermicro Super Server/A2SDi-12C-HLN4F,
> BIOS 1.7a 10/13/2022
> [14079.717796] RIP: 0010:__udp_gso_segment (??:?)
> [14079.722778] Code: c3 08 66 89 5c 02 04 45 84 e4 0f 85 27 fd ff ff
>
> Code starting with the faulting instruction
> ===========================================
> 0: c3 ret
> 1: 08 66 89 or %ah,-0x77(%rsi)
> 4: 5c pop %rsp
> 5: 02 04 45 84 e4 0f 85 add -0x7af01b7c(,%rax,2),%al
> c: 27 (bad)
> d: fd std
> e: ff (bad)
> f: ff .byte 0xff
> 49 8b 1e 49 8b ae c0 00 00 00 41 0f b7 86 b4 00 00 00 45 0f b7 a6 b2
> 00 00 00 <48> 8b b3 c0 00 00 00 0f b7 8b b2 00 00 00 49 01 ec 48 01 c5
> 48 8d
> [14079.741645] RSP: 0018:ffffa83643a4f818 EFLAGS: 00010246
> [14079.746966] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000
> [14079.754195] RDX: ffffa2ad1403b000 RSI: 0000000000000028 RDI: ffffa2afc9d302d4
> [14079.761422] RBP: ffffa2ad1403b000 R08: 0000000000000022 R09: 00002000001558c9
> [14079.768650] R10: 0000000000000000 R11: ffffa2b02fcea888 R12: 00000000000000e2
> [14079.775879] R13: ffffa2afc9d30200 R14: ffffa2afc9d30200 R15: 00002000001558c9
> [14079.783106] FS: 0000000000000000(0000) GS:ffffa2b02fcc0000(0000)
> knlGS:0000000000000000
> [14079.791305] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [14079.797162] CR2: 00000000000000c0 CR3: 0000000151ff4000 CR4: 00000000003526e0
> [14079.804408] Call Trace:
> [14079.806961] <TASK>
> [14079.809170] ? __die (??:?)
> [14079.812340] ? page_fault_oops (fault.c:?)
> [14079.816551] ? ip6_route_output_flags (??:?)
> [14079.821284] ? exc_page_fault (??:?)
> [14079.825408] ? update_load_avg (fair.c:?)
> [14079.829534] ? asm_exc_page_fault (??:?)
> [14079.833836] ? __udp_gso_segment (??:?)
> [14079.838218] ? __udp_gso_segment (??:?)
> [14079.842600] ? _raw_spin_unlock_irqrestore (??:?)
> [14079.847679] ? try_to_wake_up (core.c:?)
> [14079.851713] inet_gso_segment (??:?)
> [14079.855827] ? vhost_poll_wakeup (vhost.c:?)
> [14079.860032] skb_mac_gso_segment (??:?)
> [14079.864331] __skb_gso_segment (??:?)
> [14079.868455] ? netif_skb_features (??:?)
> [14079.872928] validate_xmit_skb (dev.c:?)
> [14079.877139] validate_xmit_skb_list (??:?)
> [14079.881612] sch_direct_xmit (??:?)
> [14079.885641] __qdisc_run (??:?)
> [14079.889324] __dev_queue_xmit (??:?)
> [14079.893450] ? nf_hook_slow (??:?)
> [14079.897229] br_dev_queue_push_xmit (??:?)
> [14079.901788] maybe_deliver (br_forward.c:?)
> [14079.905564] br_flood (??:?)
> [14079.908903] br_handle_frame_finish (??:?)
> [14079.913547] br_handle_frame (br_input.c:?)
> [14079.917585] ? ipv6_find_hdr (??:?)
> [14079.921622] ? br_handle_local_finish (??:?)
> [14079.926267] __netif_receive_skb_core.constprop.0 (dev.c:?)
> [14079.932125] ? br_handle_frame_finish (br_input.c:?)
> [14079.936946] ? ___slab_alloc (slub.c:?)
> [14079.940986] __netif_receive_skb_list_core (dev.c:?)
> [14079.946240] netif_receive_skb_list_internal (??:?)
> [14079.951660] ? napi_gro_flush (??:?)
> [14079.955604] napi_complete_done (??:?)
> [14079.959808] ixgbe_poll (??:?)
> [14079.963506] __napi_poll (dev.c:?)
> [14079.967106] napi_threaded_poll (dev.c:?)
> [14079.971405] ? __napi_poll (dev.c:?)
> [14079.975260] kthread (kthread.c:?)
> [14079.978510] ? kthread_complete_and_exit (kthread.c:?)
> [14079.983415] ret_from_fork (??:?)
> [14079.987102] </TASK>
> [14079.989395] Modules linked in: chaoskey
> [14079.993347] CR2: 00000000000000c0
> [14079.996773] ---[ end trace 0000000000000000 ]---
> [14080.018013] pstore: backend (erst) writing error (-28)
> [14080.023274] RIP: 0010:__udp_gso_segment (??:?)
> [14080.028264] Code: c3 08 66 89 5c 02 04 45 84 e4 0f 85 27 fd ff ff
>
> Code starting with the faulting instruction
> ===========================================
> 0: c3 ret
> 1: 08 66 89 or %ah,-0x77(%rsi)
> 4: 5c pop %rsp
> 5: 02 04 45 84 e4 0f 85 add -0x7af01b7c(,%rax,2),%al
> c: 27 (bad)
> d: fd std
> e: ff (bad)
> f: ff .byte 0xff
> 49 8b 1e 49 8b ae c0 00 00 00 41 0f b7 86 b4 00 00 00 45 0f b7 a6 b2
> 00 00 00 <48> 8b b3 c0 00 00 00 0f b7 8b b2 00 00 00 49 01 ec 48 01 c5
> 48 8d
> [14080.047181] RSP: 0018:ffffa83643a4f818 EFLAGS: 00010246
> [14080.052522] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000
> [14080.059765] RDX: ffffa2ad1403b000 RSI: 0000000000000028 RDI: ffffa2afc9d302d4
> [14080.067012] RBP: ffffa2ad1403b000 R08: 0000000000000022 R09: 00002000001558c9
> [14080.074257] R10: 0000000000000000 R11: ffffa2b02fcea888 R12: 00000000000000e2
> [14080.081502] R13: ffffa2afc9d30200 R14: ffffa2afc9d30200 R15: 00002000001558c9
> [14080.088746] FS: 0000000000000000(0000) GS:ffffa2b02fcc0000(0000)
> knlGS:0000000000000000
> [14080.096964] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [14080.102823] CR2: 00000000000000c0 CR3: 0000000151ff4000 CR4: 00000000003526e0
> [14080.110067] Kernel panic - not syncing: Fatal exception in interrupt
> [14080.325501] Kernel Offset: 0x12600000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [14080.353129] ---[ end Kernel panic - not syncing: Fatal exception in
> interrupt ]---
>
> The binaries aren't stripped so i don't, currently, know why it's like this...
>
> but i also get:
> gdb vmlinux
> GNU gdb (Gentoo 13.2 vanilla) 13.2
> Copyright (C) 2023 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-pc-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <https://bugs.gentoo.org/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
>
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from vmlinux...
> (No debugging symbols found in vmlinux)
> Traceback (most recent call last):
> File "/usr/src/linux/vmlinux-gdb.py", line 25, in <module>
> import linux.constants
> File "/usr/src/linux/scripts/gdb/linux/constants.py", line 10, in <module>
> LX_hrtimer_resolution = gdb.parse_and_eval("hrtimer_resolution")
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> gdb.error: 'hrtimer_resolution' has unknown type; cast it to its declared type
> ---
>
> > Thanks!
> >
> > Paolo
> >