[report net/ipv6] neighbor table overflow causing CPU lockup followed by reset

From: l00450120
Date: Sat Jun 03 2023 - 09:55:01 EST


Hello,

My server has 96 CPUs. When a large number of packets are sent and received, soft lockup alarms are generated on a large number of CPUs, and the stack is waiting for the spin_lock.

The following logs appear many times in my dmesg:
[ 563.176845] neighbour: ndisc_cache: neighbor table overflow!
[ 1861.114898] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
[ 1892.475051] watchdog: BUG: soft lockup - CPU#48 stuck for 21s! [ksoftirqd/48:255]
[ 1892.483796] Sample time: 4008751670 ns(HZ: 250)
[ 1892.483797] Sample stat:
[ 1892.483800] curr: user: 7437909680, nice: 0, sys: 49226747760, idle: 1792240181470, iowait: 135261640, irq: 1422047840, softirq: 41069616090, st: 0
[ 1892.483802] deta: user: 0, nice: 0, sys: 0, idle: 0, iowait: 0, irq: 2903860, softirq: 3997094320, st: 0
[ 1892.483803] Sample softirq:
[ 1892.483804] Sample irqstat:
[ 1892.483807] irq 14: delta 1001, curr: 473989, arch_timer
[ 1892.483839] irq 342: delta 1, curr: 929, enp129s0f0-TxRx-0
[ 1892.483896] CPU: 48 PID: 255 Comm: ksoftirqd/48 Kdump: loaded Tainted: G O 5.10.0-136.12.0.86
[ 1892.483898] Hardware name: Huawei S920X00/BC82AMDDA, BIOS 1.75 04/26/2021
[ 1892.483900] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[ 1892.483908] pc : native_queued_spin_lock_slowpath+0x254/0x37c
[ 1892.483912] lr : fib6_run_gc+0x234/0x25c
[ 1892.483914] sp : ffff8001046cb990
[ 1892.483915] x29: ffff8001046cb990 x28: ffff800101cfd810
[ 1892.483917] x27: ffff800100cee1a8 x26: 000000010006129b
[ 1892.483919] x25: 000000000000007d x24: ffff800101c94fc0
[ 1892.483921] x23: ffff800101c95048 x22: ffff8001017cafa8
[ 1892.483924] x21: ffff800100cc857c x20: 000000000000025b
[ 1892.483926] x19: ffff800101c94700 x18: 0000000000000001
[ 1892.483928] x17: 0000000000000040 x16: ffffffffffffffff
[ 1892.483930] x15: 00003fffffffffff x14: 000000000000ffff
[ 1892.483932] x13: 00000000000003f0 x12: ffff203f7fad7770
[ 1892.483934] x11: 5e0e1ffeff348dae x10: 00000000000080fe
[ 1892.483936] x9 : ffff803e7e682000 x8 : 0000000000000000
[ 1892.483938] x7 : ffff203f7fae8040 x6 : ffff800101466040
[ 1892.483940] x5 : ffff203f7fae8040 x4 : 0000000000000000
[ 1892.483942] x3 : ffff800101c95048 x2 : 0000000000000000
[ 1892.483944] x1 : 0000000000c40000 x0 : ffff203f7fae8048
[ 1892.483949] Call trace:
[ 1892.483951] native_queued_spin_lock_slowpath+0x254/0x37c
[ 1892.483955] ip6_dst_gc+0xb8/0x14c
[ 1892.483960] dst_alloc+0xa4/0xe0
[ 1892.483962] ip6_dst_alloc+0x30/0xb0
[ 1892.483964] icmp6_dst_alloc+0x8c/0x21c
[ 1892.483966] mld_sendpack+0x178/0x374
[ 1892.483968] mld_send_cr+0x350/0x530
[ 1892.483970] mld_ifc_timer_expire+0x28/0x174
[ 1892.483972] call_timer_fn+0x3c/0x180
[ 1892.483973] expire_timers+0x150/0x1d0
[ 1892.483974] run_timer_softirq+0x134/0x380
[ 1892.483978] __do_softirq+0x130/0x358
[ 1892.483980] run_ksoftirqd+0x68/0x90
[ 1892.483983] smpboot_thread_fn+0x15c/0x1a0
[ 1892.483985] kthread+0x108/0x134
[ 1892.483987] ret_from_fork+0x10/0x18

And i found out that other people had almost the same situation:
https://forum.proxmox.com/threads/unexplained-neighbor-table-overflow-causing-cpu-lockup-followed-by-reset.121289/

Looks like it's not an isolated case.

Could you give me some help?

Best regards,

LAC