Re: [PATCH] sched/all: Change BUG_ON() instances to WARN_ON()

From: John Hubbard
Date: Sat Aug 13 2022 - 21:10:43 EST


On 8/11/22 16:22, Jason Gunthorpe wrote:
On Thu, Aug 11, 2022 at 10:28:27PM +0100, Matthew Wilcox wrote:
On Thu, Aug 11, 2022 at 01:43:09PM -0700, Linus Torvalds wrote:
May I suggest going one step further, and making these WARN_ON_ONCE() instead.

>From personal experience, once some scheduler bug (or task struct
corruption) happens, ti often *keeps* happening, and the logs just
fill up with more and more data, to the point where you lose sight of
the original report (and the machine can even get unusable just from
the logging).

I've been thinking about magically turning all the WARN_ON_ONCE() into
(effectively) WARN_ON_RATELIMIT(). I had some patches in that direction
a while ago but never got round to tidying them up for submission.

If you do that, I'd like to suggest that you avoid using magic here, but
instead just rename at the call sites.

Because:

First and foremost, something named WARN_ON_ONCE() clearly has a solemn
responsibility to warn exactly "once times"! :)

Second, it's not yet clear (or is it?) that WARN_ON_ONCE() is always
worse than rate limiting. It's a trade-off, rather than a clear win for
either case, in my experience. The _ONCE variant can get overwritten
if the kernel log wraps, but the _RATELIMIT on the other hand, may be
excessive.

And finally, if it *is* agreed on here that WARN_ON_RATELIMIT() is
always better than WARN_ON_ONCE(), then there is still no harm in
spending a patch or two (coccinelle...) to rename WARN_ON_ONCE() -->
WARN_ON_RATELIMIT(), so that we end up with accurate names.


I often wonder if we have a justification for WARN_ON to even exist, I
see a lot of pressure to make things into WARN_ON_ONCE based on the
logic that spamming makes it useless..

Agreed. WARN_ON_ONCE() or WARN_ON_RATELIMIT(), take your pick. But not
WARN_ON_EVERY_TIME()--that usually causes a serious problems in the
logs.


thanks,
--
John Hubbard
NVIDIA