Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code

From: Vegard Nossum
Date: Wed Aug 13 2008 - 06:22:13 EST


(Hm, did my original e-mail not make it to the lists?)

>> The least invasive fix is to detect that we're trying to re-enter the
>> driver code. We provide a netdev_busy() function which can be used to
>> determine whether a deadlock can occur if we try to transmit another
>> packet.
>>
>> Note that this may lead to lost messages if the driver is active on
>> another CPU while we try to use the same device for netconsole.
>
> This sucks.

Indeed. What can be done about it?

>> It would probably be best to set a "lost messages" flag in this case and
>> add it to the stream when the device becomes ready again.

With such a flag, we can at least avoid silent loss of messages. But
what is the likelihood of this happening in the first place? I have no
idea.

I still think that this is better than a possible deadlock, which
gives NO CLUE as to what happened (because it jams the other
consoles). Hopefully the interface will not be used for other traffic.

But I agree. This possibly penalizes all the other cases where people
*don't* pull out their netconsole-device cables.

We could simply not print out the "link up/down" message if the device
is running netconsole. But it feels like papering over the design
error. Because there are other functions in there which may cause
additional messages to be printed, not just the very visible printk().

Can we detect if the lock was taken on the same CPU or not?

Maybe it is possible to postpone the packet instead of discarding it?

>> +static bool rtl8139_busy (struct net_device *dev)
>> +{
>> + struct rtl8139_private *tp = netdev_priv(dev);
>> + return spin_is_locked(&tp->lock);
>> +}
>
> How do I know if my driver is suspectible to this sort of deadlock?

if the ->hard_start_xmit() takes any locks which can be taken from any
code that calls (also indirectly!) printk(), in this case the
interrupt handler that is invoked when I replace the cable. My guess
is that _all_ drivers need to take some sort of lock in
hard_start_xmit().

It's a bit fragile, I agree. But the completely silent deadlock is nasty too!


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/