Re: warn: Turn the netdev timeout WARN_ON() into a WARN()

From: Arjan van de Ven
Date: Wed Sep 17 2008 - 09:41:46 EST


On Tue, 16 Sep 2008 23:27:08 -0400
Jeff Garzik <jeff@xxxxxxxxxx> wrote:

> On Wed, Sep 17, 2008 at 02:59:12AM +0000, Linux Kernel Mailing List
> wrote:
>
> hrm, am I misunderstanding?
>
> AFAICS, this change means the user is no longer notified [after
> the first time] of a condition they really need to know about --
> a hardware or driver bug.
>
> These conditions can occur many hours or days apart, and the admin
> needs to know EACH time it occurs, because it is a major networking
> event, generally leading to a complete reset of the entire hardware.

ok fair enough; the patch below drops the _ONCE

>
> And quite honestly, the backtrace is not useful (yes, even the one
> that existing previously)... THINK for a second. The backtrace
> is going to look exactly the same, since it is a timer-triggered
> dev_watchdog() call.
>
> NETDEV WATCHDOG timeouts are not easily fixable errors like lockdep
> warnings, and the admin really does need to see each one.
>
> Unless I am missing something, (1) this patch should be reverted,

.. or rather changed as below ..

> and in additional, (2) I recommend removing the WARN_ON_ONCE()
> because the backtrace is not helpful.

I disagree on your (2):
First of all, you say that these messages are really critical for the
admin; well, a WARN() is a much more critical message than a printk, and
tends to get a lot more attention to it.
Second, a WARN() is about giving a set of standard information so that
the bugreports we get are more useful the first time. Yes a backtrace
is part of that, and yes you're right just the backtrace won't have new
information here, but a WARN() message also contains things such as the
kernel version, the loaded modules and is generally easier for a user
to make a good bugreport from. That one component of it isn't going to
contain new information so be it.
Third, by using this standardized format, computer programs such as
the kerneloops tools can automatically collect these reports and do
statistics on them.

Your argument that these are not easily fixable may be valid, but
they're as you say serious issues, and at least some portion can be
fixed by doing driver work; I just got an email from an ATL1E developer
as a result of my oopsreport who said he was fixing such issues the
last week for example. If all the networking guys agree that the report
has no value for developers because they're all unfixable hardware bugs,
I'm happy to stop tracking it in kerneloops.org, but I get the impression
so far that that is not the case.




From: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
Subject: [PATCH] net: WARN_ONCE -> WARN as requested by Jeff Garzik

Jeff points out that you want to see each and every
timeout message

Signed-off-by: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
---
net/sched/sch_generic.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..8772642 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -215,7 +215,7 @@ static void dev_watchdog(unsigned long arg)
time_after(jiffies, (dev->trans_start +
dev->watchdog_timeo))) {
char drivername[64];
- WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit timed out\n",
+ WARN(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit timed out\n",
dev->name, netdev_drivername(dev, drivername, 64));
dev->tx_timeout(dev);
}
--
1.5.5.1



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/