Re: [PATCH 0/2] execve scalability issues, part 1

From: Dennis Zhou
Date: Wed Aug 23 2023 - 16:28:41 EST


On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote:
> On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> > On 8/22/23, Jan Kara <jack@xxxxxxx> wrote:
> > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> > >> On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > >> > True Fix(tm) is a longer story.
> > >> >
> > >> > Maybe let's sort out this patchset first, whichever way. :)
> > >> >
> > >>
> > >> So I found the discussion around the original patch with a perf
> > >> regression report.
> > >>
> > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> > >>
> > >> The reporter suggests dodging the problem by only allocating per-cpu
> > >> counters when the process is going multithreaded. Given that there is
> > >> still plenty of forever single-threaded procs out there I think that's
> > >> does sound like a great plan regardless of what happens with this
> > >> patchset.
> > >>
> > >> Almost all access is already done using dedicated routines, so this
> > >> should be an afternoon churn to sort out, unless I missed a
> > >> showstopper. (maybe there is no good place to stuff a flag/whatever
> > >> other indicator about the state of counters?)
> > >>
> > >> That said I'll look into it some time this or next week.
> > >
> > > Good, just let me know how it went, I also wanted to start looking into
> > > this to come up with some concrete patches :). What I had in mind was that
> > > we could use 'counters == NULL' as an indication that the counter is still
> > > in 'single counter mode'.
> > >
> >
> > In the current state there are only pointers to counters in mm_struct
> > and there is no storage for them in task_struct. So I don't think
> > merely null-checking the per-cpu stuff is going to cut it -- where
> > should the single-threaded counters land?
>
> I think you misunderstood. What I wanted to do it to provide a new flavor
> of percpu_counter (sharing most of code and definitions) which would have
> an option to start as simple counter (indicated by pcc->counters == NULL
> and using pcc->count for counting) and then be upgraded by a call to real
> percpu thing. Because I think such counters would be useful also on other
> occasions than as rss counters.
>

Kent wrote something similar and sent it out last year [1]. However, the
case slightly differs from what we'd want here, 1 -> 2 threads becomes
percpu vs update rate which a single thread might be able to trigger?

[1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@xxxxxxxxxx/

Thanks,
Dennis

> > Bonus problem, non-current can modify these counters and this needs to
> > be safe against current playing with them at the same time. (and it
> > would be a shame to require current to use atomic on them)
>
> Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
> modifying the counters for other processes. Thanks for pointing this out.
>
> > That said, my initial proposal adds a union:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5e74ce4a28cd..ea70f0c08286 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -737,7 +737,11 @@ struct mm_struct {
> >
> > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > /proc/PID/auxv */
> >
> > - struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > + union {
> > + struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > + u64 *rss_stat_single;
> > + };
> > + bool magic_flag_stuffed_elsewhere;
> >
> > struct linux_binfmt *binfmt;
> >
> >
> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > countes * 2 -- first set updated without any synchro by current
> > thread. Second set only to be modified by others and protected with
> > mm->arg_lock. The lock protects remote access to the union to begin
> > with.
>
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?
>
> If the second counter proves to be worth it, we could make just that one
> atomic to avoid the need for abusing some spinlock.
>
> > Transition to per-CPU operation sets the magic flag (there is plenty
> > of spare space in mm_struct, I'll find a good home for it without
> > growing the struct). It would be a one-way street -- a process which
> > gets a bunch of threads and goes back to one stays with per-CPU.
>
> Agreed with switching to be a one-way street.
>
> > Then you get the true value of something by adding both counters.
> >
> > arg_lock is sparingly used, so remote ops are not expected to contend
> > with anything. In fact their cost is going to go down since percpu
> > summation takes a spinlock which also disables interrupts.
> >
> > Local ops should be about the same in cost as they are right now.
> >
> > I might have missed some detail in the above description, but I think
> > the approach is decent.
>
> Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR