Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads

From: Yosry Ahmed
Date: Tue Aug 22 2023 - 11:30:48 EST


On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > Unified flushing allows for great concurrency for paths that attempt to
> > flush the stats, at the expense of potential staleness and a single
> > flusher paying the extra cost of flushing the full tree.
> >
> > This tradeoff makes sense for in-kernel flushers that may observe high
> > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > may be unexpected and problematic, especially when such stats are used
> > for critical paths such as userspace OOM handling. Additionally, a
> > userspace reader will occasionally pay the cost of flushing the entire
> > hierarchy, which also causes problems in some cases [1].
> >
> > Opt userspace reads out of unified flushing. This makes the cost of
> > reading the stats more predictable (proportional to the size of the
> > subtree), as well as the freshness of the stats. Since userspace readers
> > are not expected to have similar concurrency to in-kernel flushers,
> > serializing them among themselves and among in-kernel flushers should be
> > okay.
> >
> > This was tested on a machine with 256 cpus by running a synthetic test
> > The script that creates 50 top-level cgroups, each with 5 children (250
> > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > the stats every second (including root, top-level, and leaf cgroups --
> > so total 251 threads). No regressions were observed in the total running
> > time; which means that non-unified userspace readers are not slowing
> > down in-kernel unified flushers:
>
> I have to admit I am rather confused by cgroup_rstat_flush (and
> cgroup_rstat_flush_locked). The former says it can block but the later
> doesn't ever block and even if it drops the cgroup_rstat_lock it merely
> cond_rescheds or busy loops. How much of a contention and yielding can
> you see with this patch? What is the worst case? How bad a random user
> can make the situation by going crazy and trying to flush from many
> different contexts?

Userspace readers (or more generically non-unified flushers) can all
collectively only block a single unified flusher at most.
Specifically, one userspace reader goes to flush and holds
cgroup_rstat_lock, meanwhile an in-kernel flusher (e.g. reclaim) goes
and tries to flush, and spins on cgroup_rstat_lock. Other in-kernel
(unified) flushers will just see another unified flusher in progress
and skip. So userspace can only block a single in-kernel reclaimer.
Not that it's not really that bad because:
(a) As you note, cgroup_rstat_flush() does not really "block", it's
cpu-bound. Even when it cond_resched()'s, it yields the lock first. So
it can't really hold anyone hostage for long.
(b) I assume a random user can only read their own stats, which should
be a relatively small subtree, quick to flush. I am assuming a random
user cannot read root's memory.stat (which is most expensive).
(c) Excessive flushing doesn't really build up because there will be
nothing to flush and the lock will be released very shortly after it's
held.

So to answer your question, I don't think a random user can really
affect the system in a significant way by constantly flushing. In
fact, in the test script (which I am now attaching, in case you're
interested), there are hundreds of threads that are reading stats of
different cgroups every 1s, and I don't see any negative effects on
in-kernel flushers in this case (reclaimers).

> --
> Michal Hocko
> SUSE Labs
#!/bin/bash

NR_L1_CGROUPS=50
NR_L2_CGROUPS_PER_L1=5
NR_L2_CGROUPS=$(( NR_L1_CGROUPS * NR_L2_CGROUPS_PER_L1 ))
NR_WORKERS_PER_CGROUP=10
WORKER_MB=10
NR_TOTAL_WORKERS=$(( NR_WORKERS_PER_CGROUP * (1 + NR_L1_CGROUPS + NR_L2_CGROUPS) ))
L1_CGROUP_LIMIT_MB=$(( NR_L2_CGROUPS_PER_L1 * NR_WORKERS_PER_CGROUP * WORKER_MB / 4 ))
TOTAL_MB=$(( NR_TOTAL_WORKERS * WORKER_MB ))
TMPFS=$(mktemp -d)
ROOT="/sys/fs/cgroup/"
ADMIN="/sys/fs/cgroup/admin"
ZRAM_DEV="/mnt/devtmpfs/zram0"

cleanup_cgroup() {
local -r cg=$1
local -r procs=$(cat $cg/cgroup.procs)
if [[ -n $procs ]]; then
kill -KILL $procs
wait $procs 2>/dev/null
fi
while [[ -n $(cat $cg/cgroup.procs) ]]; do
sleep 0.1
done
rmdir $cg
}

cleanup() {
for i in $(seq $NR_L1_CGROUPS); do
l1="$ROOT/parent$i"
for j in $(seq $NR_L2_CGROUPS_PER_L1); do
l2="$l1/child$j"
cleanup_cgroup $l2
done
cleanup_cgroup $l1
done

cleanup_cgroup $ADMIN

umount $TMPFS
rm -rf $TMPFS

swapoff $ZRAM_DEV
echo 1 > "/sys/block/zram0/reset"
}
trap cleanup INT QUIT EXIT

run_stats_reader() {
local -r cg_run=$1
local -r cg_read=$2

# read the stats every second until workers are done
echo 0 > "$cg_run/cgroup.procs"
while [[ $(ls $TMPFS) ]]; do
cat "$cg_read/memory.stat" > /dev/null
sleep 1
done
}

run_worker() {
local -r cg=$1
local -r f=$2

echo 0 > "$cg/cgroup.procs"
rm -rf "$f"
dd if=/dev/zero of="$f" bs=1M count=$WORKER_MB status=none
cat "$f" > /dev/null
rm "$f"
}

# Setup zram
echo $((TOTAL_MB << 20)) > "/sys/block/zram0/disksize"
mkswap $ZRAM_DEV
swapon $ZRAM_DEV
echo "Setup zram done"

# Mount tmpfs
mount -t tmpfs none $TMPFS

# Create admin cgroup
mkdir $ADMIN

# Create worker cgroups, set limits
echo "+memory" > "$ROOT/cgroup.subtree_control"
for i in $(seq $NR_L1_CGROUPS); do
l1="$ROOT/parent$i"
mkdir $l1
echo "+memory" > "$l1/cgroup.subtree_control"
for j in $(seq $NR_L2_CGROUPS_PER_L1); do
l2="$l1/child$j"
mkdir $l2
done
echo $(( L1_CGROUP_LIMIT_MB << 20)) > "$l1/memory.max"
done
echo "Setup $NR_L1_CGROUPS L1 cgroups with limit ${L1_CGROUP_LIMIT_MB}M done"
echo "Each L1 cgroup has $NR_L2_CGROUPS_PER_L1 L2 children"

# Start workers to allocate tmpfs memory
for i in $(seq $NR_L1_CGROUPS); do
l1="$ROOT/parent$i"
for j in $(seq $NR_L2_CGROUPS_PER_L1); do
l2="$l1/child$j"
for k in $(seq $NR_WORKERS_PER_CGROUP); do
(run_worker "$l2" "$TMPFS/$i-$j-$k")&
done
done
done

# Start stat readers
(run_stats_reader "$ADMIN" "$ROOT")&
for i in $(seq $NR_L1_CGROUPS); do
l1="$ROOT/parent$i"
(run_stats_reader "$ADMIN" "$l1")&
for j in $(seq $NR_L2_CGROUPS_PER_L1); do
l2="$l1/child$j"
# Ran stat readers in the admin cgroup as well as the cgroup itself
(run_stats_reader "$ADMIN" "$l2")&
(run_stats_reader "$l2" "$l2")&
done
done

# Wait for workers
echo "Ran $NR_WORKERS_PER_CGROUP workers per L2 cgroup, each allocating ${WORKER_MB}M, waiting.."
wait