Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration

From: Roman Gushchin
Date: Mon Jun 05 2023 - 22:57:15 EST


On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote:
> On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote:
> > On Mon, Jun 05, 2023 at 10:03:25PM +0300, Kirill Tkhai wrote:
> > > Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
> > > test case caused by commit: f95bdb700bc6 ("mm: vmscan: make global slab
> > > shrink lockless"). Qi Zheng investigated that the reason is in long SRCU's
> > > synchronize_srcu() occuring in unregister_shrinker().
> > >
> > > This patch fixes the problem by using new unregistration interfaces,
> > > which split unregister_shrinker() in two parts. First part actually only
> > > notifies shrinker subsystem about the fact of unregistration and it prevents
> > > future shrinker methods calls. The second part completes the unregistration
> > > and it insures, that struct shrinker is not used during shrinker chain
> > > iteration anymore, so shrinker memory may be freed. Since the long second
> > > part is called from delayed work asynchronously, it hides synchronize_srcu()
> > > delay from a user.
> > >
> > > Signed-off-by: Kirill Tkhai <tkhai@xxxxx>
> > > ---
> > > fs/super.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/super.c b/fs/super.c
> > > index 8d8d68799b34..f3e4f205ec79 100644
> > > --- a/fs/super.c
> > > +++ b/fs/super.c
> > > @@ -159,6 +159,7 @@ static void destroy_super_work(struct work_struct *work)
> > > destroy_work);
> > > int i;
> > >
> > > + unregister_shrinker_delayed_finalize(&s->s_shrink);
> > > for (i = 0; i < SB_FREEZE_LEVELS; i++)
> > > percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> > > kfree(s);
> > > @@ -327,7 +328,7 @@ void deactivate_locked_super(struct super_block *s)
> > > {
> > > struct file_system_type *fs = s->s_type;
> > > if (atomic_dec_and_test(&s->s_active)) {
> > > - unregister_shrinker(&s->s_shrink);
> > > + unregister_shrinker_delayed_initiate(&s->s_shrink);
> >
> > Hm, it makes the API more complex and easier to mess with. Like what will happen
> > if the second part is never called? Or it's called without the first part being
> > called first?
>
> Bad things.

Agree.

> Also, it doesn't fix the three other unregister_shrinker() calls in
> the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount
> path.
>
> Those are just some of the unregister_shrinker() calls that have
> dynamic contexts that would also need this same fix; I haven't
> audited the 3 dozen other unregister_shrinker() calls around the
> kernel to determine if any of them need similar treatment, too.
>
> IOWs, this patchset is purely a band-aid to fix the reported
> regression, not an actual fix for the underlying problems caused by
> moving the shrinker infrastructure to SRCU protection. This is why
> I really want the SRCU changeover reverted.
>
> Not only are the significant changes the API being necessary, it's
> put the entire shrinker paths under a SRCU critical section. AIUI,
> this means while the shrinkers are running the RCU grace period
> cannot expire and no RCU freed memory will actually get freed until
> the srcu read lock is dropped by the shrinker.
>
> Given the superblock shrinkers are freeing dentry and inode objects
> by RCU freeing, this is also a fairly significant change of
> behaviour. i.e. cond_resched() in the shrinker processing loops no
> longer allows RCU grace periods to expire and have memory freed with
> the shrinkers are running.
>
> Are there problems this will cause? I don't know, but I'm pretty
> sure they haven't even been considered until now....
>
> > Isn't it possible to hide it from a user and call the second part from a work
> > context automatically?
>
> Nope, because it has to be done before the struct shrinker is freed.
> Those are embedded into other structures rather than being
> dynamically allocated objects.

This part we might consider to revisit, if it helps to solve other problems.
Having an extra memory allocation (or two) per mount-point doesn't look
that expensive. Again, iff it helps with more important problems.

Thanks!