Re: [patch] mm, debug: allow suppressing panic on CONFIG_DEBUG_VM checks

From: David Rientjes
Date: Tue May 23 2023 - 20:53:08 EST


On Mon, 22 May 2023, Linus Torvalds wrote:

> On Mon, May 22, 2023 at 5:52 PM David Rientjes <rientjes@xxxxxxxxxx> wrote:
> >
> > Right now kernel.panic_on_warn can either be 0 or 1. We can keep the
> > lowest bit to be "panic on all warnings" and then bit-1 as "panic on debug
> > VM warnings." When CONFIG_DEBUG_VM is enabled, set the new bit by
> > default so there's no behavior change.
>
> So right now CONFIG_DEBUG_VM being off means that there's nothing at
> all - not just no output, but also no code generation.
>
> I don't think CONFIG_DEBUG_VM in itself should enable that bit-1 behavior.
>
> That may be what *you* as a VM person wants, but VM people are not
> exactly the common case.
>
> So I think we've got several cases:
>
> (a) the "don't even build it" case (CONFIG_DEBUG_VM being off)
>
> (b) the "build it, and it is a WARN_ON_ONCE()" case
>
> (c) the *normal* "panic_on_warn=1" case, which by default would panic
> on all warnings, including any warnings from CONFIG_DEBUG_VM
>
> (d) the "VM person" case, which might not panic on normal warnings,
> but would panic on the VM warnings.
>
> and I think the use-cases are for different classes of kernel use:
>
> (a) is for people who disable debugging code until they feel it is
> needed (which I think covers a lot of kernel developers - I certainly
> personally tend to not build with debug support unless I'm chasing
> some issue down)
>
> (b) would probably be most distros - enable the warning so that the
> distro can report it, but try not to kill the machine of random people
>
> (c) would be most cloud use cases, presumably together with reboot-on-panic
>
> (d) would be people who are actual VM developers, and basically want
> the *current* behavior of VM_BUG_ON() with a machine that stops
>
> and I think (d) is the smallest set of cases of all, but is the one
> you're personally interested in.
>

If we want to change the behavior from today toward something that we
think is the more common case for enabling CONFIG_DEBUG_VM, that works
too. If we fully remove VM_BUG_ON() in favor of VM_WARN_ON() +
kernel.panic_on_warn=1, then anybody relying on getting kernel panics when
they qualify new kernels with CONFIG_DEBUG_VM will start getting WARNINGs
but not panics unless they are already using kernel.panic_on_warn.

I think that's fine, but is a change in behavior. My use cases work both
ways. If we don't set the bit-1 behavior by default on CONFIG_DEBUG_VM
then I just won't need to clear it.

I'm personally interested in (d) for debugging issues, but the intent of
this patch was actually to allow for (b) too. I want to deploy
CONFIG_DEBUG_VM with WARN_ON_ONCE() behavior to a small set of production
machines to catch latent kernel issues we don't know about, but without
impacting the workloads.

That's also very valuable because I want to surface CONFIG_DEBUG_VM checks
that would never get hit because we panic before it can be, just because
of some other VM_BUG_ON(). Your idea of WARN_ON_ONCE() will be great for
that because we can make forward progress and not be too spammy to the
kernel log.

There seems to be some agreement in the thread for removing VM_BUG_ON()
and friends in favor of VM_WARN_ON(), so I'll wait a couple days for
anymore feedback and then send a patch along.

This seems like it will be very clean and allow for (b), which is great.