Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm_cid

From: Aaron Lu
Date: Tue Apr 11 2023 - 09:12:38 EST


On Tue, Apr 11, 2023 at 12:52:25PM +0800, Aaron Lu wrote:
> On Mon, Apr 10, 2023 at 11:01:50AM -0400, Mathieu Desnoyers wrote:
> > Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> > sysbench regression reported by Aaron Lu.
> >
> > Keep track of the currently allocated mm_cid for each mm/cpu rather than
> > freeing them immediately on context switch. This eliminates most atomic
> > operations when context switching back and forth between threads
> > belonging to different memory spaces in multi-threaded scenarios (many
> > processes, each with many threads). The per-mm/per-cpu mm_cid values are
> > serialized by their respective runqueue locks.
> >
> > Thread migration is handled by introducing invocation to
> > sched_mm_cid_migrate_from() in set_task_cpu() and to
> > sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> > activate_task() for migrating tasks. set_task_cpu() is invoked with and
> > without source rq lock held: the wakeup path does not hold the source rq
> > lock.
> >
> > sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
> > index corresponding to the source runqueue if it matches the last mm_cid
> > observed by the migrated task. This last mm_cid value is returned as a
> > hint to conditionally clear the mm's per-cpu mm_cid on the destination
> > cpu.
> >
> > Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
> > the mm's destination cpu current mm_cid, clear the mm's destination cpu
> > current mm_cid. If the migrated task's mm is in use on the destination
> > cpu, the reclaim of the mm_cid will be done lazily on the next
> > destination cpu context switch, else it is performed immediately.
> >
> > The source cpu's mm_cid is _not_ simply moved to the destination cpu on
> > migration, because passing ownership of the mm_cid value to the
> > destination cpu while an actively running tasks also has its own
> > mm_cid value (in case of lazy reclaim on next context switch) would
> > over-allocate mm_cid values beyond the number of possible cpus.
> >
> > Because we want to ensure the mm_cid converges towards the smaller
> > values as migrations happen, the prior optimization that was done when
> > context switching between threads belonging to the same mm is removed,
> > because it could delay the lazy release of the destination runqueue
> > mm_cid after it has been replaced by a migration. Removing this prior
> > optimization is not an issue performance-wise because the introduced
> > per-mm/per-cpu mm_cid tracking also covers this more specific case.
> >
> > This patch is based on v6.3-rc6 with this patch applied:
> >
> > ("mm: Fix memory leak on mm_init error handling")
> >
> > https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@xxxxxxxxxxxx/
>
> Running the previouslly mentioned postgres_sysbench workload with this
> patch applied showed there is single digit lock contention from
> cid_lock, ranging from 1.x% - 7.x% during 3 minutes run. This is worse
> than v1 which I tested before where there is almost no lock contention:
> https://lore.kernel.org/lkml/20230404015949.GA14939@ziqianlu-desk2/
>
> Detail lock contention number for the 3 minutes run are:
>
> $ grep "\[k\] native_queued" *profile
> perf_0.profile: 5.44% 5.44% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_1.profile: 7.49% 7.49% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_2.profile: 6.65% 6.65% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_3.profile: 3.38% 3.38% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_4.profile: 3.01% 3.01% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_5.profile: 1.74% 1.74% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> cycles are recorded roughly every 30 seconds for a 5s window on all CPUs.
>
> And for the worst profile perf_1.profile, the call traces for the
> contention are:
>
> 7.49% 7.49% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
> 5.46% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
> 1.22% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;0x7fdcfd755d16
> 0.47% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;exit_to_user_mode_prepare;syscall_exit_to_user_mode;do_syscall_64;entry_SYSCALL_64;0x7fdcfdf9044c
> 0.11% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;0x7f2edc54e494
>
> I then also tested v3 and v2, turns out lock contention is even worse on
> those two versions. v3: 5.x% - 22.x%, v2: 6% - 22.x%. It feels to me
> something changed in v2 that brought back lock contention and then some
> optimization done in v4 made things better, but still not as good as v1.

Forget about this "v4 is better than v2 and v3" part, my later test
showed the contention can also rise to around 18% for v4.

About why v2-v4 is worse on lock contention than v1, I think that's due
to v1 invoked sched_mm_cid_migrate_from/to() in move_queued_task() path
which didn't affect tasks that migrated on wakeup time while v2-v4 invoked
sched_mm_cid_migrate_from() in set_task_cpu() which affects tasks
migrated on wakeup time. And for this workload, tasks migrated a lot on
wakeup time: during a 5s window, there are about 5 million migrations on
wakeup time while for move_queued_task(), it's only some hundred or
thousand during 5s window.