Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: Aaron Lu
Date: Thu Jun 15 2023 - 00:50:07 EST


Hi David,

On Wed, Jun 14, 2023 at 07:01:03PM -0500, David Vernet wrote:
> Hi Aaron,
>
> Thanks for taking a look and running some benchmarks. I tried to
> reproduce your results on a 26 core / 52 thread Cooperlake host, and a
> 20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the
> contention on the per-swqueue spinlock you were saw on your Ice Lake.
>
> I ran the following netperf benchmarks:
>
> - netperf -l 60 -n $(nproc) -6
> - netperf -l 60 -n $(nproc) -6 -t UDP_RR

Just to confirm: did you run multiple of the above or just one?

>
> Here are the results I'm seeing from taking the average of running those
> benchmarks three times on each host (all results are from the "Throughput"
> column of the below table).
>
> Recv Send Send
> Socket Socket Message Elapsed
> Size Size Size Time Throughput <<<
> bytes bytes bytes secs. 10^6bits/sec
>
>
> 26 / 52 x 1 Cooperlake
> ----------------------
>
> Default workload:
>
> NO_SWQUEUE: 34103.76
> SWQUEUE: 34707.46
> Delta: +1.77%
>
> UDP_RR:
>
> NO_SWQUEUE: 57695.29
> SWQUEUE: 54827.36
> Delta: -4.97%
>
> There's clearly a statistically significant decline in performance for
> UDP_RR here, but surprisingly, I don't see any contention on swqueue
> locks when I profile with perf. Rather, we seem to be contending on the
> rq lock (presumably in swqueue_pick_next_task() which is unexpected (to
> me, at least):

I don't see rq lock contention though and since you have seen some
contention here, I suppose you should have started multiple instances of
netperf client?

>
> 7.81% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath
> |
> --7.81%--queued_spin_lock_slowpath
> |
> --7.55%--task_rq_lock
> pick_next_task_fair
> schedule
> schedule_timeout
> __skb_wait_for_more_packets
> __skb_recv_udp
> udpv6_recvmsg
> inet6_recvmsg
> __x64_sys_recvfrom
> do_syscall_64
> entry_SYSCALL_64
> __libc_recvfrom
> 0x7f923fdfacf0
> 0
> 4.97% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> |
> --4.96%--_raw_spin_lock_irqsave
> |
> |--1.13%--prepare_to_wait_exclusive
> | __skb_wait_for_more_packets
> | __skb_recv_udp
> | udpv6_recvmsg
> | inet6_recvmsg
> | __x64_sys_recvfrom
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_recvfrom
> | 0x7f923fdfacf0
> | 0
> |
> |--0.92%--default_wake_function
> | autoremove_wake_function
> | __wake_up_sync_key
> | sock_def_readable
> | __udp_enqueue_schedule_skb
> | udpv6_queue_rcv_one_skb
> | __udp6_lib_rcv
> | ip6_input
> | ipv6_rcv
> | process_backlog
> | net_rx_action
> | __softirqentry_text_start
> | __local_bh_enable_ip
> | ip6_output
> | ip6_local_out
> | ip6_send_skb
> | udp_v6_send_skb
> | udpv6_sendmsg
> | __sys_sendto
> | __x64_sys_sendto
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_sendto
> |
> |--0.81%--__wake_up_sync_key
> | sock_def_readable
> | __udp_enqueue_schedule_skb
> | udpv6_queue_rcv_one_skb
> | __udp6_lib_rcv
> | ip6_input
> | ipv6_rcv
> | process_backlog
> | net_rx_action
> | __softirqentry_text_start
> | __local_bh_enable_ip
> | ip6_output
> | ip6_local_out
> | ip6_send_skb
> | udp_v6_send_skb
> | udpv6_sendmsg
> | __sys_sendto
> | __x64_sys_sendto
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_sendto
> |
> |--0.73%--enqueue_task_fair
> | |
> | --0.72%--default_wake_function
> | autoremove_wake_function
> | __wake_up_sync_key
> | sock_def_readable
> | __udp_enqueue_schedule_skb
> | udpv6_queue_rcv_one_skb
> | __udp6_lib_rcv
> | ip6_input
> | ipv6_rcv
> | process_backlog
> | net_rx_action
> | __softirqentry_text_start
> | __local_bh_enable_ip
> | ip6_output
> | ip6_local_out
> | ip6_send_skb
> | udp_v6_send_skb
> | udpv6_sendmsg
> | __sys_sendto
> | __x64_sys_sendto
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_sendto
> |
> --0.68%--task_rq_lock
> pick_next_task_fair
> schedule
> schedule_timeout
> __skb_wait_for_more_packets
> __skb_recv_udp
> udpv6_recvmsg
> inet6_recvmsg
> __x64_sys_recvfrom
> do_syscall_64
> entry_SYSCALL_64
> __libc_recvfrom
> 0x7f923fdfacf0
> 0
>
> The profile without swqueue doesn't show any contention on the rq lock,
> but does show us spending a good amount of time in the scheduler:
>
> 4.03% netperf [kernel.vmlinux] [k] default_wake_function
> |
> --3.98%--default_wake_function
> autoremove_wake_function
> __wake_up_sync_key
> sock_def_readable
> __udp_enqueue_schedule_skb
> udpv6_queue_rcv_one_skb
> __udp6_lib_rcv
> ip6_input
> ipv6_rcv
> process_backlog
> net_rx_action
> __softirqentry_text_start
> __local_bh_enable_ip
> ip6_output
> ip6_local_out
> ip6_send_skb
> udp_v6_send_skb
> udpv6_sendmsg
> __sys_sendto
> __x64_sys_sendto
> do_syscall_64
> entry_SYSCALL_64
> __libc_sendto
> 3.70% netperf [kernel.vmlinux] [k] enqueue_entity
> |
> --3.66%--enqueue_entity
> enqueue_task_fair
> |
> --3.66%--default_wake_function
> autoremove_wake_function
> __wake_up_sync_key
> sock_def_readable
> __udp_enqueue_schedule_skb
> udpv6_queue_rcv_one_skb
> __udp6_lib_rcv
> ip6_input
> ipv6_rcv
> process_backlog
> net_rx_action
> __softirqentry_text_start
> __local_bh_enable_ip
> ip6_output
> ip6_local_out
> ip6_send_skb
> udp_v6_send_skb
> udpv6_sendmsg
> __sys_sendto
> __x64_sys_sendto
> do_syscall_64
> entry_SYSCALL_64
> __libc_sendto
>
> There's clearly a measurable impact on performance here due to swqueue
> (negative for UDP_RR, positive for the default workload), but this looks
> quite different than what you were observing.

Yes.

>
> 20 / 40 x 2 Skylake
> -------------------
>
> Default workload:
>
> NO_SWQUEUE: 57437.45
> SWQUEUE: 58801.11
> Delta: +2.37%
>
> UDP_RR:
>
> NO_SWQUEUE: 75932.28
> SWQUEUE: 75232.77
> Delta: -.92%
>
> Same observation here. I didn't collect a profile, but the trend seems
> consistent, and there's clearly a tradeoff. Despite the small drop in
> perf for UDP_RR, it seems quite a bit less drastic than what would be
> expected with the contention you showed in your profile.

Indeed.

>
> 7950X (8 cores / 16 threads per CCX, 2 CCXs)
> --------------------------------------------
>
> Default workload:
>
> NO_SWQUEUE: 77615.08
> SWQUEUE: 77465.73
> Delta: -.19%
>
> UDP_RR:
>
> NO_SWQUEUE: 230258.75
> SWQUEUE: 230280.34
> Delta: ~0%
>
> I'd call this essentially a no-op.
>

> With all that said, I have a few thoughts.
>
> Firstly, would you mind please sharing your .config? It's possible that
> the hosts I'm testing on just don't have big enough LLCs to observe the
> contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host
> is smaller than a single socket of your 32/64 ICL host. On the other
> hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to

Agreed. Your 26cores/52threads isn't that smaller than mine and I had
expected to see something similar.

> see if there's a .config difference here that's causing the contention.

Attached my .config.
I also pushed the branch I used for testing to github just in case you
want to take a look: https://github.com/aaronlu/linux.git swqueue

> Also, the fact that Milan (which has only 6 cores / 12 threads per LLC)
> also saw a performance hit with swqueue for UDP_RR suggests to me that
> the issue with UDP_RR is not the scalability of the per-LLC swqueue
> spinlock.

I've tested again today and I still saw serious contention on
swqueue->lock with your cmdline. I did it this way:
"
$ netserver
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC

$ for i in `seq 128`; do netperf -l 60 -n 128 -6 -t UDP_RR & done
"
And the profile is about the same as my last posting:

83.23% 83.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
|
|--40.23%--sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --40.22%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet6_sendmsg
| udpv6_sendmsg
| udp_v6_send_skb
| ip6_send_skb
| ip6_local_out
| ip6_output
| ip6_finish_output
| ip6_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ipv6_rcv
| ip6_input
| ip6_input_finish
| ip6_protocol_deliver_rcu
| udpv6_rcv
| __udp6_lib_rcv
| udp6_unicast_rcv_skb
| udpv6_queue_rcv_skb
| udpv6_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --40.22%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| |
| --40.22%--_raw_spin_lock_irqsave
| |
| --40.21%--native_queued_spin_lock_slowpath
|
|--38.59%--recvfrom
| |
| --38.59%--entry_SYSCALL_64
| do_syscall_64
| __x64_sys_recvfrom
| __sys_recvfrom
| sock_recvmsg
| inet6_recvmsg
| udpv6_recvmsg
| __skb_recv_udp
| __skb_wait_for_more_packets
| schedule_timeout
| schedule
| __schedule
| |
| --38.59%--pick_next_task_fair
| |
| --38.59%--swqueue_remove_task
| |
| --38.59%--_raw_spin_lock_irqsave
| |
| --38.58%--native_queued_spin_lock_slowpath
|
|--2.25%--sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --2.25%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet6_sendmsg
| udpv6_sendmsg
| udp_v6_send_skb
| ip6_send_skb
| ip6_local_out
| ip6_output
| ip6_finish_output
| ip6_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ipv6_rcv
| ip6_input
| ip6_input_finish
| ip6_protocol_deliver_rcu
| udpv6_rcv
| __udp6_lib_rcv
| udp6_unicast_rcv_skb
| udpv6_queue_rcv_skb
| udpv6_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --2.25%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| _raw_spin_lock_irqsave
| |
| --2.25%--native_queued_spin_lock_slowpath
|
--2.10%--recvfrom
entry_SYSCALL_64
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
sock_recvmsg
inet6_recvmsg
udpv6_recvmsg
__skb_recv_udp
__skb_wait_for_more_packets
schedule_timeout
schedule
__schedule
|
--2.09%--pick_next_task_fair
swqueue_remove_task
_raw_spin_lock_irqsave
|
--2.09%--native_queued_spin_lock_slowpath

During the test, vmstat showed lines like this:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
185 0 0 128555784 47348 1791244 0 0 0 0 32387 4380484 1 98 0 0 0

When swqueue is disabled, vmstat showed lines like this and there is no
lock contention:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
116 0 0 128977136 58324 1338760 0 0 0 16 599677 9734395 5 72 23 0 0

The runnable tasks are a lot more when swqueue is in use and sys% also
increased, perhaps due to the lock contention.

I'll see if I can find a smaller machine and give it a run there too.

Thanks,
Aaron

Attachment: config-6.4.0-rc5-00237-g06b8769b15ae.gz
Description: application/gzip