Re: [RFC PATCH 1/2] psi: introduce memory.pressure.stat

From: CGEL
Date: Tue Aug 16 2022 - 22:59:57 EST


On Wed, Aug 03, 2022 at 09:55:39AM -0400, Johannes Weiner wrote:
> On Mon, Aug 01, 2022 at 12:42:04AM +0000, cgel.zte@xxxxxxxxx wrote:
> > From: cgel <cgel@xxxxxxxxxx>
> >
> > For now psi memory pressure account for all the mem stall in the
> > system, And didnot provide a detailed information why the stall
> > happens. This patch introduce a cgroupu knob memory.pressure.stat,
> > it tells the detailed stall information of all memory events and it
> > format and the corresponding proc interface.
> >
> > for the cgroup, add memory.pressure.stat and it shows:
> > kswapd: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> > direct reclaim: avg10=0.00 avg60=0.00 avg300=0.12 total=42356
> > kcompacted: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> > direct compact: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> > cgroup reclaim: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> > workingset thrashing: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> >
> > for the system wide, a proc file introduced as pressure/memory_stat
> > and the format is the same as the cgroup interface.
> >
> > With this detaled information, for example, if the system is stalled
> > because of kcompacted, compaction_proactiveness can be promoted so
> > pro-compaction can be involved earlier.
> >
> > Signed-off-by: cgel <cgel@xxxxxxxxxx>
>
> > @@ -64,9 +91,11 @@ struct psi_group_cpu {
> >
> > /* Aggregate pressure state derived from the tasks */
> > u32 state_mask;
> > + u32 state_memstall;
> >
> > /* Period time sampling buckets for each state of interest (ns) */
> > u32 times[NR_PSI_STATES];
> > + u32 times_mem[PSI_MEM_STATES];
>
> This doubles the psi cache footprint on every context switch, wakeup,
> sleep, etc. in the scheduler. You're also adding more branches to
> those same paths. It'll measurably affect everybody who is using psi.
>
> Yet, in the years of using psi in production myself, I've never felt
> the need for what this patch provides. There are event counters for
> everything that contributes to pressure, and it's never been hard to
> rootcause spikes. There are also things like bpftrace that let you
> identify who is stalling for how long in order to do one-off tuning
> and systems introspection.
>
We think this patch is not for rootcause spikes, it's for automatic optimize
memory besides oomd, especially for sysctl adjustment. For example if we see
much pressure of direct reclaim the automatic optimize program might turn up
watermark_scale_factor.
The base idea is that this patch gives user a brief UI to know what kind of
memory pressure the system is suffering, and to optimize the system in a fine
grain. It could provide data for user to adjust watermark_boost_factor,
extfrag_threshold, compaction_proactiveness,transparent_hugepage/defrag,
swappiness, vfs_cache_pressure, madvise(), which may not easy for to do
before.

It's not easy for automatic optimize program to use tools likes bpftrace or
ftrace to do this.

While we may use CONFIG_PSI_XX or bootparam to turn on/off this patch to avoid
additional footprint for user who not need this.