Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm_cid

From: Aaron Lu
Date: Tue Apr 11 2023 - 00:52:44 EST


On Mon, Apr 10, 2023 at 11:01:50AM -0400, Mathieu Desnoyers wrote:
> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> sysbench regression reported by Aaron Lu.
>
> Keep track of the currently allocated mm_cid for each mm/cpu rather than
> freeing them immediately on context switch. This eliminates most atomic
> operations when context switching back and forth between threads
> belonging to different memory spaces in multi-threaded scenarios (many
> processes, each with many threads). The per-mm/per-cpu mm_cid values are
> serialized by their respective runqueue locks.
>
> Thread migration is handled by introducing invocation to
> sched_mm_cid_migrate_from() in set_task_cpu() and to
> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> activate_task() for migrating tasks. set_task_cpu() is invoked with and
> without source rq lock held: the wakeup path does not hold the source rq
> lock.
>
> sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
> index corresponding to the source runqueue if it matches the last mm_cid
> observed by the migrated task. This last mm_cid value is returned as a
> hint to conditionally clear the mm's per-cpu mm_cid on the destination
> cpu.
>
> Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
> the mm's destination cpu current mm_cid, clear the mm's destination cpu
> current mm_cid. If the migrated task's mm is in use on the destination
> cpu, the reclaim of the mm_cid will be done lazily on the next
> destination cpu context switch, else it is performed immediately.
>
> The source cpu's mm_cid is _not_ simply moved to the destination cpu on
> migration, because passing ownership of the mm_cid value to the
> destination cpu while an actively running tasks also has its own
> mm_cid value (in case of lazy reclaim on next context switch) would
> over-allocate mm_cid values beyond the number of possible cpus.
>
> Because we want to ensure the mm_cid converges towards the smaller
> values as migrations happen, the prior optimization that was done when
> context switching between threads belonging to the same mm is removed,
> because it could delay the lazy release of the destination runqueue
> mm_cid after it has been replaced by a migration. Removing this prior
> optimization is not an issue performance-wise because the introduced
> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>
> This patch is based on v6.3-rc6 with this patch applied:
>
> ("mm: Fix memory leak on mm_init error handling")
>
> https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@xxxxxxxxxxxx/

Running the previouslly mentioned postgres_sysbench workload with this
patch applied showed there is single digit lock contention from
cid_lock, ranging from 1.x% - 7.x% during 3 minutes run. This is worse
than v1 which I tested before where there is almost no lock contention:
https://lore.kernel.org/lkml/20230404015949.GA14939@ziqianlu-desk2/

Detail lock contention number for the 3 minutes run are:

$ grep "\[k\] native_queued" *profile
perf_0.profile: 5.44% 5.44% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_1.profile: 7.49% 7.49% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_2.profile: 6.65% 6.65% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_3.profile: 3.38% 3.38% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_4.profile: 3.01% 3.01% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_5.profile: 1.74% 1.74% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

cycles are recorded roughly every 30 seconds for a 5s window on all CPUs.

And for the worst profile perf_1.profile, the call traces for the
contention are:

7.49% 7.49% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
5.46% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
1.22% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;0x7fdcfd755d16
0.47% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;exit_to_user_mode_prepare;syscall_exit_to_user_mode;do_syscall_64;entry_SYSCALL_64;0x7fdcfdf9044c
0.11% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;0x7f2edc54e494

I then also tested v3 and v2, turns out lock contention is even worse on
those two versions. v3: 5.x% - 22.x%, v2: 6% - 22.x%. It feels to me
something changed in v2 that brought back lock contention and then some
optimization done in v4 made things better, but still not as good as v1.

Thanks,
Aaron