Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: David Vernet
Date: Thu Jun 15 2023 - 19:27:34 EST


On Thu, Jun 15, 2023 at 03:31:53PM +0800, Aaron Lu wrote:
> On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote:
> > I'll see if I can find a smaller machine and give it a run there too.
>
> Found a Skylake with 18cores/36threads on each socket/LLC and with
> netperf, the contention is still serious.
>
> "
> $ netserver
> $ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features"
> $ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done
> "
>
> 53.61% 53.61% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
> |
> |--27.93%--sendto
> | entry_SYSCALL_64
> | do_syscall_64
> | |
> | --27.93%--__x64_sys_sendto
> | __sys_sendto
> | sock_sendmsg
> | inet6_sendmsg
> | udpv6_sendmsg
> | udp_v6_send_skb
> | ip6_send_skb
> | ip6_local_out
> | ip6_output
> | ip6_finish_output
> | ip6_finish_output2
> | __dev_queue_xmit
> | __local_bh_enable_ip
> | do_softirq.part.0
> | __do_softirq
> | net_rx_action
> | __napi_poll
> | process_backlog
> | __netif_receive_skb
> | __netif_receive_skb_one_core
> | ipv6_rcv
> | ip6_input
> | ip6_input_finish
> | ip6_protocol_deliver_rcu
> | udpv6_rcv
> | __udp6_lib_rcv
> | udp6_unicast_rcv_skb
> | udpv6_queue_rcv_skb
> | udpv6_queue_rcv_one_skb
> | __udp_enqueue_schedule_skb
> | sock_def_readable
> | __wake_up_sync_key
> | __wake_up_common_lock
> | |
> | --27.85%--__wake_up_common
> | receiver_wake_function
> | autoremove_wake_function
> | default_wake_function
> | try_to_wake_up
> | |
> | --27.85%--ttwu_do_activate
> | enqueue_task
> | enqueue_task_fair
> | |
> | --27.85%--_raw_spin_lock_irqsave
> | |
> | --27.85%--native_queued_spin_lock_slowpath
> |
> --25.67%--recvfrom
> entry_SYSCALL_64
> do_syscall_64
> __x64_sys_recvfrom
> __sys_recvfrom
> sock_recvmsg
> inet6_recvmsg
> udpv6_recvmsg
> __skb_recv_udp
> |
> --25.67%--__skb_wait_for_more_packets
> schedule_timeout
> schedule
> __schedule
> |
> --25.66%--pick_next_task_fair
> |
> --25.65%--swqueue_remove_task
> |
> --25.65%--_raw_spin_lock_irqsave
> |
> --25.65%--native_queued_spin_lock_slowpath
>
> I didn't aggregate the throughput(Trans. Rate per sec) from all these
> clients, but a glimpse from the result showed that the throughput of
> each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE).
>
> Thanks,
> Aaron

Ok, it seems that the issue is that I wasn't creating enough netperf
clients. I assumed that -n $(nproc) was sufficient. I was able to repro
the contention on my 26 core / 52 thread skylake client as well:


41.01% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath
|
--41.01%--queued_spin_lock_slowpath
|
--40.63%--_raw_spin_lock_irqsave
|
|--21.18%--enqueue_task_fair
| |
| --21.09%--default_wake_function
| |
| --21.09%--autoremove_wake_function
| |
| --21.09%--__wake_up_sync_key
| sock_def_readable
| __udp_enqueue_schedule_skb
| udpv6_queue_rcv_one_skb
| __udp6_lib_rcv
| ip6_input
| ipv6_rcv
| process_backlog
| net_rx_action
| |
| --21.09%--__softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
|
--19.44%--swqueue_remove_task
|
--19.42%--pick_next_task_fair
|
--19.42%--schedule
|
--19.21%--schedule_timeout
__skb_wait_for_more_packets
__skb_recv_udp
udpv6_recvmsg
inet6_recvmsg
__x64_sys_recvfrom
do_syscall_64
entry_SYSCALL_64
40.87% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath
|
--40.87%--queued_spin_lock_slowpath
|
--40.51%--_raw_spin_lock_irqsave
|
|--21.03%--enqueue_task_fair
| |
| --20.94%--default_wake_function
| |
| --20.94%--autoremove_wake_function
| |
| --20.94%--__wake_up_sync_key
| sock_def_readable
| __udp_enqueue_schedule_skb
| udpv6_queue_rcv_one_skb
| __udp6_lib_rcv
| ip6_input
| ipv6_rcv
| process_backlog
| net_rx_action
| |
| --20.94%--__softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
|
--19.48%--swqueue_remove_task
|
--19.47%--pick_next_task_fair
schedule
|
--19.38%--schedule_timeout
__skb_wait_for_more_packets
__skb_recv_udp
udpv6_recvmsg
inet6_recvmsg
__x64_sys_recvfrom
do_syscall_64
entry_SYSCALL_64

Thanks for the help in getting the repro on my end.

So yes, there is certainly a scalability concern to bear in mind for
swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
e.g. blocking and waking on futexes in a tight loop, I expect a similar
issue would be observed.

On the other hand, the issue did not occur on my 7950X. I also wasn't
able to repro the contention on the Skylake if I ran with the default
netperf workload rather than UDP_RR (even with the additional clients).
I didn't bother to take the mean of all of the throughput results
between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.

So swqueue isn't ideal for every configuration, but I'll echo my
sentiment from [0] that this shouldn't on its own necessarily preclude
it from being merged given that it does help a large class of
configurations and workloads, and it's disabled by default.

[0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

Thanks,
David