Re: [PATCH] mm: Sync percpu mm RSS counters before querying

From: Andrew Morton
Date: Fri Jun 16 2023 - 15:31:52 EST


On Fri, 16 Jun 2023 20:07:18 +0200 Michal Koutný <mkoutny@xxxxxxxx> wrote:

> An issue was observed with stats collected in struct rusage on ppc64le
> with 64kB pages. The percpu counters use batching with
> percpu_counter_batch = max(32, nr*2) # in PAGE_SIZE
> i.e. with larger pages but similar RSS consumption (bytes), there'll be
> less flushes and error more noticeable.

A fully detailed description of the issue would be helpful. Obviously
"inaccuracy", but how bad?

> In this given case (getting consumption of exited child), we can request
> percpu counter's flush without worrying about contention with updaters.
>
> Fortunately, the commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") didn't eradicate all traces of SPLIT_RSS_COUNTING and
> this mechanism already provided some synchronization points before
> reading stats.
> Therefore, use sync_mm_rss as carrier for percpu counters refreshes and
> forget SPLIT_RSS_COUNTING macro for good.
>
> Impact of summing on a 8 CPU machine:
> Benchmark 1: taskset -c 1 ./shell-bench.sh
>
> Before
> Time (mean ± σ): 9.950 s ± 0.052 s [User: 7.773 s, System: 2.023 s]
>
> After
> Time (mean ± σ): 9.990 s ± 0.070 s [User: 7.825 s, System: 2.011 s]
>
> cat >shell-bench.sh <<EOD
> for (( i = 0; i < 20000; i++ )); do
> /bin/true
> done
> EOD
>
> The script is meant to stress fork-exit path (exit is where sync_mm_rss
> is most called, add_mm_rss_vec should be covered in fork).
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2547,13 +2547,12 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
> *maxrss = hiwater_rss;
> }
>
> -#if defined(SPLIT_RSS_COUNTING)
> -void sync_mm_rss(struct mm_struct *mm);
> -#else
> static inline void sync_mm_rss(struct mm_struct *mm)
> {
> + for (int i = 0; i < NR_MM_COUNTERS; ++i)
> + percpu_counter_set(&mm->rss_stat[i],
> + percpu_counter_sum(&mm->rss_stat[i]));
> }
> -#endif

Far too large to be inlined! For six callsites it adds 1kb of text.

Why even modify the counter? Can't <whatever this issue is> be
addressed by using percpu_counter_sum() in an appropriate place?

For unknown reasons percpu_counter_set() uses for_each_possible_cpu().
Probably just a mistake - percpu_counters are hotplug-aware and
for_each_online_cpu should suffice.

I'm really not liking percpu_counter_set(). It's only safe in
situations where the caller knows that no other CPU can be modifying
the counter. I wonder if all the callers know that. This situation
isn't aided by the lack of any documentation.