Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

From: Peter Zijlstra
Date: Tue Jun 20 2023 - 05:12:44 EST


On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
> Hello Mathieu,
>
> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268
> > Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
> > Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> > AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00
> > Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
> >
> > sched: Fix performance regression introduced by mm_cid
> >
> > Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> > sysbench regression reported by Aaron Lu.
> >
> > Keep track of the currently allocated mm_cid for each mm/cpu rather than
> > freeing them immediately on context switch. This eliminates most atomic
> > operations when context switching back and forth between threads
> > belonging to different memory spaces in multi-threaded scenarios (many
> > processes, each with many threads). The per-mm/per-cpu mm_cid values are
> > serialized by their respective runqueue locks.
> >
> > Thread migration is handled by introducing invocation to
> > sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> > activate_task() for migrating tasks. If the destination cpu's mm_cid is
> > unset, and if the source runqueue is not actively using its mm_cid, then
> > the source cpu's mm_cid is moved to the destination cpu on migration.
> >
> > Introduce a task-work executed periodically, similarly to NUMA work,
> > which delays reclaim of cid values when they are unused for a period of
> > time.
> >
> > Keep track of the allocation time for each per-cpu cid, and let the task
> > work clear them when they are observed to be older than
> > SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
> > mm_cids which are greater or equal to the Hamming weight of the mm
> > cidmask to keep concurrency ids compact.
> >
> > Because we want to ensure the mm_cid converges towards the smaller
> > values as migrations happen, the prior optimization that was done when
> > context switching between threads belonging to the same mm is removed,
> > because it could delay the lazy release of the destination runqueue
> > mm_cid after it has been replaced by a migration. Removing this prior
> > optimization is not an issue performance-wise because the introduced
> > per-mm/per-cpu mm_cid tracking also covers this more specific case.
> >
> > Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> > Reported-by: Aaron Lu <aaron.lu@xxxxxxxxx>
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> > Tested-by: Aaron Lu <aaron.lu@xxxxxxxxx>
> > Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>
> I run standard benchmarks as a part of kernel performance regression
> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
> I have seen performance regression in hackbench running with threads. When I did
> git bisect it pointed to this commit and reverting this commit helps regains
> the performance. This regression is not seen with hackbench processes.

Well, *this* commit was supposed to help fix the horrible contention on
cid_lock that was introduced with af7f588d8f73.

> Following are the results from 1 Socket 4th generation EPYC
> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
> becomes more severe as the number of core count increases.
>
> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
>
> Threads:
>
> Test: With-mmcid-patch Without-mmcid-patch
> 1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct)
> 2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct)
> 4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct)
> 8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct)
> 16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)

I'm really confused, so you're saying that having a process wide
spinlock is better than what this patch does? Or are you testing against
something without mm-cid entirely?