Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm_cid

From: michael . christie
Date: Mon Apr 10 2023 - 17:51:58 EST


On 4/10/23 10:01 AM, Mathieu Desnoyers wrote:
> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> sysbench regression reported by Aaron Lu.
>
> Keep track of the currently allocated mm_cid for each mm/cpu rather than
> freeing them immediately on context switch. This eliminates most atomic
> operations when context switching back and forth between threads
> belonging to different memory spaces in multi-threaded scenarios (many
> processes, each with many threads). The per-mm/per-cpu mm_cid values are
> serialized by their respective runqueue locks.
>
> Thread migration is handled by introducing invocation to
> sched_mm_cid_migrate_from() in set_task_cpu() and to
> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> activate_task() for migrating tasks. set_task_cpu() is invoked with and
> without source rq lock held: the wakeup path does not hold the source rq
> lock.
>
> sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
> index corresponding to the source runqueue if it matches the last mm_cid
> observed by the migrated task. This last mm_cid value is returned as a
> hint to conditionally clear the mm's per-cpu mm_cid on the destination
> cpu.
>
> Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
> the mm's destination cpu current mm_cid, clear the mm's destination cpu
> current mm_cid. If the migrated task's mm is in use on the destination
> cpu, the reclaim of the mm_cid will be done lazily on the next
> destination cpu context switch, else it is performed immediately.
>
> The source cpu's mm_cid is _not_ simply moved to the destination cpu on
> migration, because passing ownership of the mm_cid value to the
> destination cpu while an actively running tasks also has its own
> mm_cid value (in case of lazy reclaim on next context switch) would
> over-allocate mm_cid values beyond the number of possible cpus.
>
> Because we want to ensure the mm_cid converges towards the smaller
> values as migrations happen, the prior optimization that was done when
> context switching between threads belonging to the same mm is removed,
> because it could delay the lazy release of the destination runqueue
> mm_cid after it has been replaced by a migration. Removing this prior
> optimization is not an issue performance-wise because the introduced
> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>
> This patch is based on v6.3-rc6 with this patch applied:
>
> ("mm: Fix memory leak on mm_init error handling")
>
> https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@xxxxxxxxxxxx/
>
> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Olivier Dion <odion@xxxxxxxxxxxx>
> Cc: michael.christie@xxxxxxxxxx

Hey thanks for fixing this.

When testing linux-next with vhost devices, without this patch IOPs get stuck at around
1.3 million IOPs total when using 8 or more devices (you get a worker thread per device)
per VM. With this patch applied IOPs scale again, and we get up to 2.4M iops when using
up to 16 devices per VM.

Tested-by: Mike Christie <michael.christie@xxxxxxxxxx>