Re: [PATCH] xfs: idle aild if the AIL is pushed up to the target LSN

From: Dave Chinner
Date: Wed Apr 27 2016 - 18:30:00 EST


On Wed, Apr 27, 2016 at 08:31:38PM +0200, Lucas Stach wrote:
> Am Dienstag, den 26.04.2016, 09:08 +1000 schrieb Dave Chinner:
> [...]
> > >
> > > >
> > > > That said, I'm not sure whether there's a notable benefit of
> > > > idling
> > > > for
> > > > 50ms over just scheduling out when we've hit the target lsn. It
> > > > seems
> > > > like that anybody who pushes the target forward again is going to
> > > > wake
> > > > up the thread anyways. On the other hand, if the fs is idle the
> > > > thread
> > > > will eventually schedule out indefinitely. 
> > > Is this a problem? The patch tries to do exactly that: schedule out
> > > aild indefinitely when there is no more work to do as nobody is
> > > pushing
> > > the target LSN forward.
> > If the filesystem is slowly being dirtied, then the aild should't
> > really idle at all.i
> >
> > Keep in mind that the xfsaild has multiple functions, one of which
> > is a watchdog that catches log space stalls that would otherwise
> > hang the filesystem. Every time we've removed the watchdog function
> > (i.e.  agressively idle the aild) we've had users report random,
> > unreproducable hangs/stalls that have gone away when the watchdog
> > function (i.e. don't idle until the log is covered and completely
> > idle) was re-instated...
> >
> I can only see xfsaild_push() doing any work after it has hit the
> target LSN if something moves the target LSN forward. You say that
> aggressively idling aild might produce log stalls, which would imply
> there are races in the code where a code path that moves the target LSN
> forward doesn't properly wake up aild.

Well, yes. The code is horrifically complex, there's a heap of
lockless operations along with cross-subsystem co-ordinated
operations done under different locks amongst other things to
provide scalability. History tells me that no matter whether we
*think* we've got it right, there's always another bug lurking.

> Wouldn't this problem also be present when doing non-aggressive idle of
> aild, just the probability of hitting the issue being reduced
> significantly?

Welcome to Risk Management 101.

I've got better things to do with my time than remove a safety net
and the be forced to spend days or even weeks trying to solve all
the subtle, deeply hidden problems that have been around for 20
years that are now exposed to users. That's not a productive use of
the limited amount of XFS developer's time we have available. If you
want to go ahead and do all this, I'll be happy to spend a year
teaching you about how all the log space reservation code works...

> The commit that re-enabled non-aggressive aild idle
> especially mentions some races that have been fixed and I think those
> fixes should allow for agressive aild idle. If they are insufficient it
> wouldn't be safe to idle aild at all, right?

No - the log state machine that covers (idles) the log is the one we
really care about and it guarantees that the AIL is empty. i.e. The
AIL has active items in it until the log is covered and hence, by
definition, it can't be idle until the log is covered.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx