Re: Expensive memory.stat + cpu.stat reads

From: Yosry Ahmed
Date: Fri Aug 11 2023 - 20:15:55 EST

Next message: Sohil Mehta: "Re: [PATCH 1/1] x86/pti: Fix kernel warnings for pti= and nopti cmdline options."
Previous message: kernel test robot: "arch/x86/platform/uv/bios_uv.c:175:12: sparse: sparse: function 'uv_bios_get_master_nasid' with external linkage has definition"
In reply to: Yosry Ahmed: "Re: Expensive memory.stat + cpu.stat reads"
Next in thread: Tejun Heo: "Re: Expensive memory.stat + cpu.stat reads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Aug 11, 2023 at 4:43 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Fri, Aug 11, 2023 at 3:03 PM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote:
> >
> > On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote:
> > > > > My understanding of mem-stat and cpu-stat is that they are independent
> > > > > of each other. In theory, reading one shouldn't affect the performance
> > > > > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > > > > repetitively in a loop, it is likely that all the data are in the cache
> > > > > most of the time resulting in very fast processing time. If it happens
> > > > > that the specific memory location of mem-stat and cpu-stat data are such
> > > > > that reading one will cause the other data to be flushed out of the
> > > > > cache and have to be re-read from memory again, you could see
> > > > > significant performance regression.
> > > > >
> > > > > It is one of the possible causes, but I may be wrong.
> > > >
> > > > Do you think it's somewhat similar to how iterating a matrix in rows
> > > > is faster than in columns due to sequential vs random memory reads?
> > > >
> > > > * https://stackoverflow.com/q/9936132
> > > > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > > > * https://en.wikipedia.org/wiki/Loop_interchange
> > > >
> > > > I've had a similar suspicion and it would be good to confirm whether
> > > > it's that or something else. I can probably collect perf counters for
> > > > different runs, but I'm not sure which ones I'll need.
> > > >
> > > > In a similar vein, if we could come up with a tracepoint that would
> > > > tell us the amount of work done (or any other relevant metric that
> > > > would help) during rstat flushing, I can certainly collect that
> > > > information as well for every reading combination.
> > >
> > > Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> > > (discrete) and slow (combined) cases, I grabbed some stats for it:
> > >
> > > * Slow:
> > >
> > > completed: 19.43s [manual / mem-stat + cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > > ^C
> > > 00:12:55
> > > usecs : count distribution
> > > 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | |
> > > 4 -> 7 : 0 | |
> > > 8 -> 15 : 0 | |
> > > 16 -> 31 : 0 | |
> > > 32 -> 63 : 0 | |
> > > 64 -> 127 : 1 | |
> > > 128 -> 255 : 191 |************ |
> > > 256 -> 511 : 590 |****************************************|
> > > 512 -> 1023 : 186 |************ |
> > > 1024 -> 2047 : 2 | |
> > > 2048 -> 4095 : 0 | |
> > > 4096 -> 8191 : 0 | |
> > > 8192 -> 16383 : 504 |********************************** |
> > > 16384 -> 32767 : 514 |********************************** |
> > > 32768 -> 65535 : 3 | |
> > > 65536 -> 131071 : 1 | |
> > >
> > > avg = 8852 usecs, total: 17633268 usecs, count: 1992
> > >
> > > * Fast:
> > >
> > > completed: 0.95s [manual / mem-stat]
> > > completed: 0.05s [manual / cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > > ^C
> > > 00:13:27
> > > usecs : count distribution
> > > 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | |
> > > 4 -> 7 : 499 |****************************************|
> > > 8 -> 15 : 253 |******************** |
> > > 16 -> 31 : 191 |*************** |
> > > 32 -> 63 : 41 |*** |
> > > 64 -> 127 : 12 | |
> > > 128 -> 255 : 2 | |
> > > 256 -> 511 : 2 | |
> > > 512 -> 1023 : 0 | |
> > > 1024 -> 2047 : 0 | |
> > > 2048 -> 4095 : 0 | |
> > > 4096 -> 8191 : 0 | |
> > > 8192 -> 16383 : 34 |** |
> > > 16384 -> 32767 : 21 |* |
> > >
> > > avg = 857 usecs, total: 904762 usecs, count: 1055
> > >
> > > There's a different number of calls into cgroup_rstat_flush_locked and
> > > they are much slower in the slow case. There are also two bands in the
> > > slow case, with 8ms..32ms having the half of the calls.
> > >
> > > For mem_cgroup_css_rstat_flush:
> > >
> > > * Slow:
> > >
> > > completed: 32.77s [manual / mem-stat + cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > > ^C
> > > 00:21:25
> > > usecs : count distribution
> > > 0 -> 1 : 93078 |* |
> > > 2 -> 3 : 3397714 |****************************************|
> > > 4 -> 7 : 1009440 |*********** |
> > > 8 -> 15 : 168013 |* |
> > > 16 -> 31 : 93 | |
> > >
> > > avg = 3 usecs, total: 17189289 usecs, count: 4668338
> > >
> > > * Fast:
> > >
> > > completed: 0.16s [manual / mem-stat]
> > > completed: 0.04s [manual / cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > > ^C
> > > 00:21:57
> > > usecs : count distribution
> > > 0 -> 1 : 1441 |*** |
> > > 2 -> 3 : 18780 |****************************************|
> > > 4 -> 7 : 4826 |********** |
> > > 8 -> 15 : 732 |* |
> > > 16 -> 31 : 1 | |
> > >
> > > avg = 3 usecs, total: 89174 usecs, count: 25780
> > >
> > > There's an 181x difference in the number of calls into
> > > mem_cgroup_css_rstat_flush.
> > >
> > > Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> > > yielding a ton more iterations for some reason here?
> > >
> > > * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
> > >
> > > It's inlined, but I can place a probe into the loop:
> > >
> > > 7 for_each_possible_cpu(cpu) {
> > > 8 raw_spinlock_t *cpu_lock =
> > > per_cpu_ptr(&cgroup_rstat_cpu_lock,
> > > cpu);
> > > 10 struct cgroup *pos = NULL;
> > > unsigned long flags;
> > >
> > > /*
> > > * The _irqsave() is needed because cgroup_rstat_lock is
> > > * spinlock_t which is a sleeping lock on
> > > PREEMPT_RT. Acquiring
> > > * this lock with the _irq() suffix only
> > > disables interrupts on
> > > * a non-PREEMPT_RT kernel. The raw_spinlock_t
> > > below disables
> > > * interrupts on both configurations. The
> > > _irqsave() ensures
> > > * that interrupts are always disabled and
> > > later restored.
> > > */
> > > raw_spin_lock_irqsave(cpu_lock, flags);
> > > while ((pos =
> > > cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> > > struct cgroup_subsys_state *css;
> > >
> > > cgroup_base_stat_flush(pos, cpu);
> > > 26 bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> > >
> > > 28 rcu_read_lock();
> > > 29 list_for_each_entry_rcu(css,
> > > &pos->rstat_css_list,
> > > rstat_css_node)
> > > 31 css->ss->css_rstat_flush(css, cpu);
> > > 32 rcu_read_unlock();
> > > }
> > > 34 raw_spin_unlock_irqrestore(cpu_lock, flags);
> > >
> > > I added probes on both line 26 and line 31 to catch the middle and inner loops.
> > >
> > > * Slow:
> > >
> > > completed: 32.97s [manual / mem-stat + cpu-stat]
> > >
> > > Performance counter stats for '/tmp/derp':
> > >
> > > 4,702,570 probe:cgroup_rstat_flush_locked_L26
> > > 9,301,436 probe:cgroup_rstat_flush_locked_L31
> > >
> > > * Fast:
> > >
> > > completed: 0.17s [manual / mem-stat]
> > > completed: 0.34s [manual / cpu-stat]
> > >
> > > Performance counter stats for '/tmp/derp':
> > >
> > > 31,769 probe:cgroup_rstat_flush_locked_L26
> > > 62,849 probe:cgroup_rstat_flush_locked_L31
> > >
> > > It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> > > lot more positions.
> > >
> > > I'm going to sign off for the week, but let me know if I should place
> > > any more probes to nail this down.
> >
> > I spent some time looking into this and I think I landed on a fix:
> >
> > * https://github.com/bobrik/linux/commit/50b627811d54
> >
> > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > the runtime significantly.
>
> Flushing the entire hierarchy in mem_cgroup_flush_stats() was added
> such that concurrent flushers can just skip and let one flusher do the
> work for everyone. This was added because we flush the stats in some
> paths (like reclaim, refault, dirty throttling) where sometimes there
> is a lot of concurrency and we have a thundering herd problem on the
> cgroup rstat global lock.
>
> Maybe we can separate userspace reads from other flushers, such that
> userspace reads flush the cgroup in question only, while in-kernel
> flushers skip if someone else is flushing.
>
> There is also some inconsistency today as not all paths use
> mem_cgroup_flush_stats() (see zswap charging function in
> mm/memcontrol.c).
>
> Separating userspace reads from in-kernel flushers would also help
> because skipping a flush if someone else is flushing for userspace
> reads can lead to inaccuracy (see [1]).
>
> I would wait for Shakeel to weigh in here, since he introduced the
> unified flushing.
>
> [1]https://lore.kernel.org/lkml/20230809045810.1659356-1-yosryahmed@xxxxxxxxxx/
>

+Tejun Heo

There have been a lot of problems coming from this global rstat lock:
hard lockups (when we used to flush atomically), unified flushing
being expensive, skipping flushing being inaccurate, etc.

I wonder if it's time to rethink this lock and break it down into
granular locks. Perhaps a per-cgroup lock, and develop a locking
scheme where you always lock a parent then a child, then flush the
child and unlock it and move to the next child, etc. This will allow
concurrent flushing of non-root cgroups. Even when flushing the root,
if we flush all its children first without locking the root, then only
lock the root when flushing the top-level children, then some level of
concurrency can be achieved.

Maybe this is too complicated, I never tried to implement it, but I
have been bouncing around this idea in my head for a while now.

We can also split the update tree per controller. As far as I can tell
there is no reason to flush cpu stats for example when someone wants
to read memory stats.

Next message: Sohil Mehta: "Re: [PATCH 1/1] x86/pti: Fix kernel warnings for pti= and nopti cmdline options."
Previous message: kernel test robot: "arch/x86/platform/uv/bios_uv.c:175:12: sparse: sparse: function 'uv_bios_get_master_nasid' with external linkage has definition"
In reply to: Yosry Ahmed: "Re: Expensive memory.stat + cpu.stat reads"
Next in thread: Tejun Heo: "Re: Expensive memory.stat + cpu.stat reads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]