Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline

From: Paul E. McKenney
Date: Thu Jun 28 2018 - 08:36:38 EST


On Thu, Jun 28, 2018 at 10:26:53AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 27, 2018 at 10:13:34PM -0700, Paul E. McKenney wrote:
> > On Wed, Jun 27, 2018 at 07:51:34PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 27, 2018 at 08:57:21AM -0700, Paul E. McKenney wrote:
> > > > > Another variant, which simply skips the wakeup whever ran on an offline
> > > > > CPU, relying on the wakeup from rcutree_migrate_callbacks() right after
> > > > > the CPU really is dead.
> > > >
> > > > Cute! ;-)
> > > >
> > > > And a much smaller change.
> > > >
> > > > However, this means that if someone indirectly and erroneously causes
> > > > rcu_report_qs_rsp() to be invoked from an offline CPU, the result is an
> > > > intermittent and difficult-to-debug grace-period hang. A lockdep splat
> > > > whose stack trace directly implicates the culprit is much better.
> > >
> > > How so? We do an unconditional wakeup right after finding the offline
> > > cpu dead. There is only very limited code between offline being true and
> > > the CPU reporting in dead.
> >
> > I am thinking more generally than this particular patch. People
> > sometimes invoke things from places they shouldn't, for example, the
> > situation leading to your patch that allows use of RCU much earlier in
> > the CPU-online process. It is nicer to get a splat in those situations
> > than a silent hang.
>
> The rcu_rnp_online_cpus() thing would catch that, right? The public RCU
> API isn't that big, and should already complain afaict.

Please let me try again.

The approach you are suggesting, clever though it is, disables a check
of a type that has proved to be an important diagnostic in the past.
It is only reasonable to assume that this check would be important
and helpful in the future, but only if that check remains in the code.
Yes, agreed, given the current structure of the code, this particular
instance of the check would not matter, but experience indicates that
RCU code restructuring is not at all uncommon, with the current effort
being but one case in point.

So, unless I am missing something, the only possible benefit of disabling
this check is getting rid of an acquisition of an uncontended lock in
a code path that is miles (sorry, kilometers) away from any fastpath.
So, again, yes, it is clever. If it sped up a fastpath, I might be
sorely tempted to take it. But the alternative is straightforward and
isn't anywhere near a fastpath. So, though I do very much appreciate
the cleverness and creativity, I am not seeing your change to be a
good tradeoff from a long-term maintainability viewpoint.

Am I missing something here?

Thanx, Paul