Re: BUG_ON() in workingset_node_shadows_dec() triggers

From: Linus Torvalds
Date: Tue Oct 04 2016 - 23:29:07 EST


On Tue, Oct 4, 2016 at 7:43 PM, Paul Gortmaker
<paul.gortmaker@xxxxxxxxxxxxx> wrote:
>
> A couple years ago Ingo had an idea to kill BUG_ON abuse that I made
> a 1st pass at. Back then it seemed nobody cared. Maybe that has since
> changed?
>
> https://lkml.org/lkml/2014/4/30/359

So we actually have the checkpatch warning already:

# avoid BUG() or BUG_ON()
if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
my $msg_type = \&WARN;
$msg_type = \&CHK if ($file);
&{$msg_type}("AVOID_BUG",
"Avoid crashing the kernel - try
using WARN_ON & recovery code rather than BUG() or BUG_ON()\n" .
$herecurr);
}

but it doesn't trigger on VM_BUG_ON().

And I'm not convinced about replacing things with BUG_ON_AND_HALT(),
it simply doesn't fix the existing issue we have: people use BUG_ON(),
and worse, _when_ they use BUG_ON(), they use it instead of error
handling, so the code _around_ the BUG_ON() tends to then very much
depend on what the BUG_ON() checks.

This is actually one way that VM_BUG_ON() is better: it's very much by
design something that can be compiled away, so at least hopefully
nobody thinks of it as a security measure. So we could just say that
we will treat VM_BUG_ON() as a WARN_ON_ONCE(), and just not kill the
machine.

Because I could easily imagine that somebody *does* treat BUG_ON()
that way, thinking "well, if that BUG_ON() triggers at least it won't
then go off the rails later".

In fact, right now we mark BUG() in such a way that gcc can even take
advantage of the "crash the machine" semantics, because we explicitly
mark the BUG() with "unreachable()".

So what I think we should think about is:

- extending the checkpatch warning to VM_BUG_ON too, to discourage new users.

- look at making BUG_ON() simply be less lethal. Remove the
unrechable(), reorganize how the string is stored, and make it act
more like WARN_ON_ONCE() instead (it's the "rewind_stack_do_exit()"
that ends up causing us to try to kill things, we *could* just try to
stop doing that).

- Instead of adding a BUG_ON_AND_HALT(), we could perhaps add a new
FATAL_ERROR() thing that acts like the current BUG_ON, and *not* call
it something similar (we don't want people doing mindless
conversions!). And that's the one that would do the whole
rewind_stack_do_exit() to kill the process.

But apart from the checkpatch thing, it's actually a pretty big change.

Linus