Re: Crashes with 874bbfe600a6 in 3.18.25

From: Mike Galbraith
Date: Tue Feb 09 2016 - 10:31:37 EST


On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote:
> On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> >
> > > What are you suggesting?
> >
> > That 874bbfe6 should die.
>
> Yeah, it's gonna be killed. The commit is there because the behavior
> change broke things. We don't want to guarantee it but have been and
> can't change it right away just because we don't like it when things
> may break from it. The plan is to implement a debug option to force
> workqueue to always execute these work items on a foreign cpu to weed
> out breakages.

A niggling question remaining is when is it gonna be killed?

1. Meanwhile, 874bbfe6 was sent to 2.6.31+, meaning that every stable
tree where it landed which did not ALSO receive 22b886dd has become
destabilized. We have two 3.12-stability reports, one the hotplug
explosion that you provided a workaround for, one the corruption, and
one corruption report for 3.18. Both breakage types would be sort of
fixed up by getting 22b886dd and your hotplug workaround (which does
_not_ guarantee survival) were applied everywhere, however...

2. We also have a report for the 3.18 corruption victim that adding
22b886dd did NOT restore the stable status quo, rather it replaced the
corruption that 874bbfe6 caused with a performance regression.

3. 874bbfe6 + 22b886dd also inflicts a NO_HZ_FULL regression.
Admittedly not a huge deal, but another regression nonetheless.

The only evidence I've seen that anything at all was the broken by the
changes that triggered the inception of 874bbfe6 in the first place was
the b0rked vmstat thing that Linus had already fixed with 176bed1d. So
where is the breakage you mention that makes keeping 874bbfe6 the
prudent thing to do vs just reverting 874bbfe6 immediately, perhaps
22b886dd as well given it is fallout thereof, and getting that sent off
to stable?

It looks for all the world as if the sole excuse for either to exist is
to prevent any other stupid mistakes like the vmstat thing from being
exposed for what they are by actively hiding them, when in fact, that
hiding doesn't survive a hotplug event (as we saw in the crash analysis
I showed you). Surely there's a better reason to keep that commit than
hiding bugs that can only remain hidden until they meet hotplug. What
is it?

-Mike