Re: [PATCH 0/3] clear_warn_once: add timed interval resetting

From: Petr Mladek
Date: Wed Dec 09 2020 - 11:38:02 EST


On Thu 2020-11-26 01:30:26, Paul Gortmaker wrote:
> The existing clear_warn_once functionality is currently a manually
> issued state reset via the file /sys/kernel/debug/clear_warn_once when
> debugfs is mounted. The idea being that a developer would be running
> some tests, like LTP or similar, and want to check reproducibility
> without having to reboot.
>
> But you currently can't make use of clear_warn_once unless you've got
> debugfs enabled and mounted - which may not be desired by some people
> in some deployment situations.
>
> The functionality added here allows for periodic resets in addition to
> the one-shot reset it already had. Then we allow for a boot-time setting
> of the periodic resets so it can be used even when debugfs isn't mounted.
>
> By having a periodic reset, we also open the door for having the various
> "once" functions act as long period ratelimited messages, where a sysadmin
> can pick an hour or a day reset if they are facing an issue and are
> wondering "did this just happen once, or am I only being informed once?"

OK, I though more about it and I NACK this patchset.

My reason:

1. The primary purpose was to provide a way to reset warn_once() without
debugfs. From this POV, the solution is rather complicated: timers
and another kernel parameter.

2. I am not aware of any convincing argument why debugfs could not be
mounted on the debugged system.

3. Debugfs provides many more debugging facilities. It is designed for
this purpose. It does not look like a good strategy to provide
alternative interfaces just to avoid it.

4. There were mentioned several other use cases for this feature,
like RT systems. But it was not clear that it was really needed
or that people would really use it.

5. Some code might even rely on that it is called only once, see commit
dfbf2897d00499f94cd ("bug: set warn variable before calling
WARN()") or the recent
https://lore.kernel.org/r/20201029142406.3c46855a@xxxxxxxxxxxxxxxxxx

It should better stay as debugging feature that should be used with
care.


6. It creates system wide ratelimited printk().

We have printk_ratelimited() for this. And it is quite problematic.
It is supposed to prevent flood of printk() messages. But it does
not work well because the limits depend on more factors, like:
system size, conditions, console speed.

Yes, the proposed feature is supposed to solve another problem
(lack of messages). But it is a global action that would
re-enable >1000 messages that were limited to be printed
only once because they could be too frequent. As a result:

+ it might cause flood of printk() messages

+ it is hard to define a good system wide time limit;
it was even unclear what should be the lower limit.

+ it will restart the messages at some "random" point,
so that the relation of the reported events would
be unclear.

From the API point of view:

+ printk_ratelimited() is used when we want to see that a
problem is still there. It is per-message setting.

+ printk_once() is used when even printk_ratelimited() would
be too much. It is per-message setting.

+ The new printk_repeated_once() is a strange mix of this two
with the global setting. It does not fit much.


Best Regards,
Petr

PS: I did not answer your last mail because it looked like an endless
fight over words or point of views. I decided to make a summary
of my view instead. These are reason why I nacked it.

I know that there might be different views but so far no arguments
changed mine. And I do not know how to explain it better.