Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: Aaron Lu
Date: Wed Jun 14 2023 - 00:37:03 EST


On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
>
> Still gotta read it properly, however:
>
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
>
> Could you please also benchmark on something Intel that has these stupid
> large LLCs ?
>
> Because the last time I tried something like this, it came apart real
> quick. And AMD has these relatively small 8-core LLCs.

I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
socket has a single LLC with 32 cores/64threads.

When running netperf with nr_thread=128, runtime=60:

"
netserver -4

for i in `seq $nr_threads`; do
netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime &
done

wait
"

The lock contention due to the per-LLC swqueue->lock is quite serious:

83.39% 83.33% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
|
|--42.86%--__libc_sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --42.81%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet_sendmsg
| udp_sendmsg
| udp_send_skb
| ip_send_skb
| ip_output
| ip_finish_output
| __ip_finish_output
| ip_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| |
| --42.81%--net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ip_rcv
| ip_local_deliver
| ip_local_deliver_finish
| ip_protocol_deliver_rcu
| udp_rcv
| __udp4_lib_rcv
| udp_unicast_rcv_skb
| udp_queue_rcv_skb
| udp_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --42.81%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| _raw_spin_lock_irqsave
| |
| --42.81%--native_queued_spin_lock_slowpath
|
|--20.39%--0
| __libc_recvfrom
| entry_SYSCALL_64
| do_syscall_64
| __x64_sys_recvfrom
| __sys_recvfrom
| sock_recvmsg
| inet_recvmsg
| udp_recvmsg
| __skb_recv_udp
| __skb_wait_for_more_packets
| schedule_timeout
| schedule
| __schedule
| pick_next_task_fair
| |
| --20.39%--swqueue_remove_task
| _raw_spin_lock_irqsave
| |
| --20.39%--native_queued_spin_lock_slowpath
|
--20.07%--__libc_recvfrom
entry_SYSCALL_64
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
sock_recvmsg
inet_recvmsg
udp_recvmsg
__skb_recv_udp
__skb_wait_for_more_packets
schedule_timeout
schedule
__schedule
|
--20.06%--pick_next_task_fair
swqueue_remove_task
_raw_spin_lock_irqsave
|
--20.06%--native_queued_spin_lock_slowpath

I suppose that is because there are too many CPUs in a single LLC on
this machine and when all these CPUs try to queue/pull tasks in this
per-LLC shared wakequeue, it just doesn't scale well.