Re: [RFC PATCH] xfs: fix per-cpu CIL structure aggregation racing with dying cpus

From: Dave Chinner
Date: Tue Aug 22 2023 - 21:44:56 EST


On Tue, Aug 22, 2023 at 06:20:11PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 23, 2023 at 09:28:16AM +1000, Dave Chinner wrote:
> > On Tue, Aug 22, 2023 at 11:30:16AM -0700, Darrick J. Wong wrote:
> > > On Mon, Aug 21, 2023 at 03:17:52PM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 04, 2023 at 03:38:54PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > >
> > > > > In commit 7c8ade2121200 ("xfs: implement percpu cil space used
> > > > > calculation"), the XFS committed (log) item list code was converted to
> > > > > use per-cpu lists and space tracking to reduce cpu contention when
> > > > > multiple threads are modifying different parts of the filesystem and
> > > > > hence end up contending on the log structures during transaction commit.
> > > > > Each CPU tracks its own commit items and space usage, and these do not
> > > > > have to be merged into the main CIL until either someone wants to push
> > > > > the CIL items, or we run over a soft threshold and switch to slower (but
> > > > > more accurate) accounting with atomics.
> > > > >
> > > > > Unfortunately, the for_each_cpu iteration suffers from the same race
> > > > > with cpu dying problem that was identified in commit 8b57b11cca88f
> > > > > ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
> > > > > cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a
> > > > > result, both CIL percpu structure aggregation functions fail to collect
> > > > > the items and accounted space usage at the correct point in time.
> > > > >
> > > > > If we're lucky, the items that are collected from the online cpus exceed
> > > > > the space given to those cpus, and the log immediately shuts down in
> > > > > xlog_cil_insert_items due to the (apparent) log reservation overrun.
> > > > > This happens periodically with generic/650, which exercises cpu hotplug
> > > > > vs. the filesystem code.
> > > > >
> > > > > Applying the same sort of fix from 8b57b11cca88f to the CIL code seems
> > > > > to make the generic/650 problem go away, but I've been told that tglx
> > > > > was not happy when he saw:
> > > > >
> > > > > "...the only thing we actually need to care about is that
> > > > > percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
> > > > > there are no CPUs dying, it has no addition overhead except for a
> > > > > cpumask_or() operation."
> > > > >
> > > > > I have no idea what the /correct/ solution is, but I've been holding on
> > > > > to this patch for 3 months. In that time, 8b57b11cca88 hasn't been
> > > > > reverted and cpu_dying_mask hasn't been removed, so I'm sending this and
> > > > > we'll see what happens.
> > > > >
> > > > > So, how /do/ we perform periodic aggregation of per-cpu data safely?
> > > > > Move the xlog_cil_pcp_dead call to the dying section, where at least the
> > > > > other cpus will all be stopped? And have it dump its items into any
> > > > > online cpu's data structure?
> > > >
> > > > I suspect that we have to stop using for_each_*cpu() and hotplug
> > > > notifiers altogether for this code.
> > > >
> > > > That is, we simply create our own bitmap for nr_possible_cpus in the
> > > > CIL checkpoint context we allocate for each checkpoint (i.e. the
> > > > struct xfs_cil_ctx). When we store something on that CPU for that
> > > > CIL context, we set the corresponding bit for that CPU. Then when we
> > > > are aggregating the checkpoint, we simply walk all the cpus with the
> > > > "has items" bit set and grab everything.
> > > >
> > > > This gets rid of the need for hotplug notifiers completely - we just
> > > > don't care if the CPU is online or offline when we sweep the CIL
> > > > context for items at push time - if the bit is set then there's
> > > > stuff to sweep...
> > >
> > > Oooh, good idea. Something like this, then? Lightly tested via
> > > generic/650 for five minutes...
> >
> > Not quite what I was saying - you put the cpu mask in the struct
> > xfs_cil, which means it can potentially be accessed from both commit
> > and aggregation at the same time for different checkpoints.
> >
> > What I was suggesting is that it gets placed in the struct
> > xfs_cil_ctx which is guaranteed to be private to
>
> private to...?

Sorry, I thought I removed that chunk from my reply - ignore it.

....
> > .....
> >
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 09638e8fb4ee..ef7775657ce3 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -2313,7 +2313,6 @@ xfs_cpu_dead(
> > > list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
> > > spin_unlock(&xfs_mount_list_lock);
> > > xfs_inodegc_cpu_dead(mp, cpu);
> > > - xlog_cil_pcp_dead(mp->m_log, cpu);
> > > spin_lock(&xfs_mount_list_lock);
> > > }
> > > spin_unlock(&xfs_mount_list_lock);
> >
> > I wonder if we can do something similar for the inodegc code, and
> > once again get rid of the need for hotpulg notifiers in XFS?
>
> I would imagine so, if you don't mind bloating up xfs_mount.

The xfs_mount is already huge... :)

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx