Re: [PATCH 0/2] execve scalability issues, part 1

From: Jan Kara
Date: Thu Aug 24 2023 - 05:21:16 EST

Next message: XueBing Chen: "[PATCH] drm/amdgpu/sdma: Clean up errors in sdma_v3_0.c"
Previous message: David Wronek: "[PATCH v5 4/4] arm64: dts: qcom: Add support for the Xiaomi SM7125 platform"
Next in thread: Mateusz Guzik: "Re: [PATCH 0/2] execve scalability issues, part 1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed 23-08-23 13:27:56, Dennis Zhou wrote:
> On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote:
> > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> > > On 8/22/23, Jan Kara <jack@xxxxxxx> wrote:
> > > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> > > >> On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > > >> > True Fix(tm) is a longer story.
> > > >> >
> > > >> > Maybe let's sort out this patchset first, whichever way. :)
> > > >> >
> > > >>
> > > >> So I found the discussion around the original patch with a perf
> > > >> regression report.
> > > >>
> > > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> > > >>
> > > >> The reporter suggests dodging the problem by only allocating per-cpu
> > > >> counters when the process is going multithreaded. Given that there is
> > > >> still plenty of forever single-threaded procs out there I think that's
> > > >> does sound like a great plan regardless of what happens with this
> > > >> patchset.
> > > >>
> > > >> Almost all access is already done using dedicated routines, so this
> > > >> should be an afternoon churn to sort out, unless I missed a
> > > >> showstopper. (maybe there is no good place to stuff a flag/whatever
> > > >> other indicator about the state of counters?)
> > > >>
> > > >> That said I'll look into it some time this or next week.
> > > >
> > > > Good, just let me know how it went, I also wanted to start looking into
> > > > this to come up with some concrete patches :). What I had in mind was that
> > > > we could use 'counters == NULL' as an indication that the counter is still
> > > > in 'single counter mode'.
> > > >
> > >
> > > In the current state there are only pointers to counters in mm_struct
> > > and there is no storage for them in task_struct. So I don't think
> > > merely null-checking the per-cpu stuff is going to cut it -- where
> > > should the single-threaded counters land?
> >
> > I think you misunderstood. What I wanted to do it to provide a new flavor
> > of percpu_counter (sharing most of code and definitions) which would have
> > an option to start as simple counter (indicated by pcc->counters == NULL
> > and using pcc->count for counting) and then be upgraded by a call to real
> > percpu thing. Because I think such counters would be useful also on other
> > occasions than as rss counters.
> >
>
> Kent wrote something similar and sent it out last year [1]. However, the
> case slightly differs from what we'd want here, 1 -> 2 threads becomes
> percpu vs update rate which a single thread might be able to trigger?

Thanks for the pointer but that version of counters is not really suitable
here as is (but we could factor out some common bits if that work is
happening). 1 thread can easily do 10000 RSS updates per second.

> [1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@xxxxxxxxxx/

Honza

> > > Bonus problem, non-current can modify these counters and this needs to
> > > be safe against current playing with them at the same time. (and it
> > > would be a shame to require current to use atomic on them)
> >
> > Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
> > modifying the counters for other processes. Thanks for pointing this out.
> >
> > > That said, my initial proposal adds a union:
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 5e74ce4a28cd..ea70f0c08286 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -737,7 +737,11 @@ struct mm_struct {
> > >
> > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > > /proc/PID/auxv */
> > >
> > > - struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > + union {
> > > + struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > + u64 *rss_stat_single;
> > > + };
> > > + bool magic_flag_stuffed_elsewhere;
> > >
> > > struct linux_binfmt *binfmt;
> > >
> > >
> > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > > countes * 2 -- first set updated without any synchro by current
> > > thread. Second set only to be modified by others and protected with
> > > mm->arg_lock. The lock protects remote access to the union to begin
> > > with.
> >
> > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> > with two counters is clever but I'm not 100% convinced the complexity is
> > really worth it. I'm not sure the overhead of always using an atomic
> > counter would really be measurable as atomic counter ops in local CPU cache
> > tend to be cheap. Did you try to measure the difference?
> >
> > If the second counter proves to be worth it, we could make just that one
> > atomic to avoid the need for abusing some spinlock.
> >
> > > Transition to per-CPU operation sets the magic flag (there is plenty
> > > of spare space in mm_struct, I'll find a good home for it without
> > > growing the struct). It would be a one-way street -- a process which
> > > gets a bunch of threads and goes back to one stays with per-CPU.
> >
> > Agreed with switching to be a one-way street.
> >
> > > Then you get the true value of something by adding both counters.
> > >
> > > arg_lock is sparingly used, so remote ops are not expected to contend
> > > with anything. In fact their cost is going to go down since percpu
> > > summation takes a spinlock which also disables interrupts.
> > >
> > > Local ops should be about the same in cost as they are right now.
> > >
> > > I might have missed some detail in the above description, but I think
> > > the approach is decent.
> >
> > Honza
> > --
> > Jan Kara <jack@xxxxxxxx>
> > SUSE Labs, CR
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR

Next message: XueBing Chen: "[PATCH] drm/amdgpu/sdma: Clean up errors in sdma_v3_0.c"
Previous message: David Wronek: "[PATCH v5 4/4] arm64: dts: qcom: Add support for the Xiaomi SM7125 platform"
Next in thread: Mateusz Guzik: "Re: [PATCH 0/2] execve scalability issues, part 1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]