Re: [PATCH] psi: fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim

From: Peter Zijlstra
Date: Sat Nov 13 2021 - 02:44:12 EST


On Fri, Nov 12, 2021 at 11:53:20AM -0500, Johannes Weiner wrote:
> On Wed, Nov 10, 2021 at 09:33:12PM +0000, Brian Chen wrote:
> > We've noticed cases where tasks in a cgroup are stalled on memory but
> > there is little memory FULL pressure since tasks stay on the runqueue
> > in reclaim.
> >
> > A simple example involves a single threaded program that keeps leaking
> > and touching large amounts of memory. It runs in a cgroup with swap
> > enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
> > there is significant CPU pressure and memory SOME, there is barely any
> > memory FULL since the task enters reclaim and stays on the runqueue.
> > However, this memory-bound task is effectively stalled on memory and
> > we expect memory FULL to match memory SOME in this scenario.
> >
> > The code is confused about memstall && running, thinking there is a
> > stalled task and a productive task when there's only one task: a
> > reclaimer that's counted as both. To fix this, we redefine the
> > condition for PSI_MEM_FULL to check that all running tasks are in an
> > active memstall instead of checking that there are no running tasks.
> >
> > case PSI_MEM_FULL:
> > - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
> > + return unlikely(tasks[NR_MEMSTALL] &&
> > + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
> >
> > This will capture reclaimers. It will also capture tasks that called
> > psi_memstall_enter() and are about to sleep, but this should be
> > negligible noise.
> >
> > Signed-off-by: Brian Chen <brianchen118@xxxxxxxxx>
>
> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>
> This bug essentially causes us to count memory-some in walltime and
> memory-full in tasktime, which can be quite confusing and misleading
> in combined CPU and memory pressure situations.
>
> The fix looks good to me, thanks Brian.
>
> The bug's been there since the initial psi commit, so I don't think a
> stable backport is warranted.
>
> Peter, absent objections, can you please pick this up through -tip?

Yep can do. Note that our psi_group_cpu data structure is now completely
filled (the extra tasks state filled the last hole):

struct psi_group_cpu {
seqcount_t seq __attribute__((__aligned__(64))); /* 0 4 */
unsigned int tasks[5]; /* 4 20 */
u32 state_mask; /* 24 4 */
u32 times[7]; /* 28 28 */
u64 state_start; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
u32 times_prev[2][7] __attribute__((__aligned__(64))); /* 64 56 */

/* size: 128, cachelines: 2, members: 6 */
/* padding: 8 */
/* forced alignments: 2 */
} __attribute__((__aligned__(64)));