Re: [PATCH 0/2] execve scalability issues, part 1

From: Jan Kara
Date: Wed Aug 23 2023 - 05:52:49 EST


On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> On 8/22/23, Jan Kara <jack@xxxxxxx> wrote:
> > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> >> On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> >> > True Fix(tm) is a longer story.
> >> >
> >> > Maybe let's sort out this patchset first, whichever way. :)
> >> >
> >>
> >> So I found the discussion around the original patch with a perf
> >> regression report.
> >>
> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> >>
> >> The reporter suggests dodging the problem by only allocating per-cpu
> >> counters when the process is going multithreaded. Given that there is
> >> still plenty of forever single-threaded procs out there I think that's
> >> does sound like a great plan regardless of what happens with this
> >> patchset.
> >>
> >> Almost all access is already done using dedicated routines, so this
> >> should be an afternoon churn to sort out, unless I missed a
> >> showstopper. (maybe there is no good place to stuff a flag/whatever
> >> other indicator about the state of counters?)
> >>
> >> That said I'll look into it some time this or next week.
> >
> > Good, just let me know how it went, I also wanted to start looking into
> > this to come up with some concrete patches :). What I had in mind was that
> > we could use 'counters == NULL' as an indication that the counter is still
> > in 'single counter mode'.
> >
>
> In the current state there are only pointers to counters in mm_struct
> and there is no storage for them in task_struct. So I don't think
> merely null-checking the per-cpu stuff is going to cut it -- where
> should the single-threaded counters land?

I think you misunderstood. What I wanted to do it to provide a new flavor
of percpu_counter (sharing most of code and definitions) which would have
an option to start as simple counter (indicated by pcc->counters == NULL
and using pcc->count for counting) and then be upgraded by a call to real
percpu thing. Because I think such counters would be useful also on other
occasions than as rss counters.

> Bonus problem, non-current can modify these counters and this needs to
> be safe against current playing with them at the same time. (and it
> would be a shame to require current to use atomic on them)

Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
modifying the counters for other processes. Thanks for pointing this out.

> That said, my initial proposal adds a union:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5e74ce4a28cd..ea70f0c08286 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -737,7 +737,11 @@ struct mm_struct {
>
> unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> /proc/PID/auxv */
>
> - struct percpu_counter rss_stat[NR_MM_COUNTERS];
> + union {
> + struct percpu_counter rss_stat[NR_MM_COUNTERS];
> + u64 *rss_stat_single;
> + };
> + bool magic_flag_stuffed_elsewhere;
>
> struct linux_binfmt *binfmt;
>
>
> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> countes * 2 -- first set updated without any synchro by current
> thread. Second set only to be modified by others and protected with
> mm->arg_lock. The lock protects remote access to the union to begin
> with.

arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
with two counters is clever but I'm not 100% convinced the complexity is
really worth it. I'm not sure the overhead of always using an atomic
counter would really be measurable as atomic counter ops in local CPU cache
tend to be cheap. Did you try to measure the difference?

If the second counter proves to be worth it, we could make just that one
atomic to avoid the need for abusing some spinlock.

> Transition to per-CPU operation sets the magic flag (there is plenty
> of spare space in mm_struct, I'll find a good home for it without
> growing the struct). It would be a one-way street -- a process which
> gets a bunch of threads and goes back to one stays with per-CPU.

Agreed with switching to be a one-way street.

> Then you get the true value of something by adding both counters.
>
> arg_lock is sparingly used, so remote ops are not expected to contend
> with anything. In fact their cost is going to go down since percpu
> summation takes a spinlock which also disables interrupts.
>
> Local ops should be about the same in cost as they are right now.
>
> I might have missed some detail in the above description, but I think
> the approach is decent.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR