Re: [PATCH 0/2] execve scalability issues, part 1

From: Jan Kara
Date: Wed Aug 23 2023 - 11:47:33 EST


On Wed 23-08-23 14:13:20, Mateusz Guzik wrote:
> On 8/23/23, Jan Kara <jack@xxxxxxx> wrote:
> > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> >> On 8/22/23, Jan Kara <jack@xxxxxxx> wrote:
> >> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> >> countes * 2 -- first set updated without any synchro by current
> >> thread. Second set only to be modified by others and protected with
> >> mm->arg_lock. The lock protects remote access to the union to begin
> >> with.
> >
> > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> > with two counters is clever but I'm not 100% convinced the complexity is
> > really worth it. I'm not sure the overhead of always using an atomic
> > counter would really be measurable as atomic counter ops in local CPU cache
> > tend to be cheap. Did you try to measure the difference?
> >
>
> arg_lock is not as is, it would have to be renamed to something more generic.

Ah, OK.

> Atomics on x86-64 are very expensive to this very day. Here is a
> sample measurement of 2 atomics showing up done by someone else:
> https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@xxxxxxxxx/T/#u
>
> tl;dr it is *really* bad.

I didn't express myself well. Sure atomics are expensive compared to plain
arithmetic operations. But I wanted to say - we had atomics for RSS
counters before commit f1a7941243 ("mm: convert mm's rss stats into
percpu_counter") and people seemed happy with it until there were many CPUs
contending on the updates. So maybe RSS counters aren't used heavily enough
for the difference to practically matter? Probably operation like faulting
in (or unmapping) tmpfs file has the highest chance of showing the cost of
rss accounting compared to the cost of the remainder of the operation...

> > If the second counter proves to be worth it, we could make just that one
> > atomic to avoid the need for abusing some spinlock.
>
> The spinlock would be there to synchronize against the transition to
> per-cpu -- any trickery is avoided and we trivially know for a fact
> the remote party either sees the per-cpu state if transitioned, or
> local if not. Then one easily knows no updates have been lost and the
> buf for 2 sets of counters can be safely freed.

Yeah, the spinlock makes the transition simpler, I agree.

> While writing down the idea previously I did not realize the per-cpu
> counter ops disable interrupts around the op. That's already very slow
> and the trip should be comparable to paying for an atomic (as in the
> patch which introduced percpu counters here slowed things down for
> single-threaded processes).
>
> With your proposal the atomic would be there, but interrupt trip could
> be avoided. This would roughly maintain the current cost of doing the
> op (as in it would not get /worse/). My patch would make it lower.
>
> All that said, I'm going to refrain from writing a patch for the time
> being. If powers to be decide on your approach, I'm not going to argue
> -- I don't think either is a clear winner over the other.

I guess we'll need to code it and compare the results :)

Honza

--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR