Re: [PATCH v2 3/7] [PATCH 3/8] can: CAN Network device driver andNetlink interface

From: Oliver Hartkopp
Date: Fri May 15 2009 - 03:15:38 EST


Wolfgang Grandegger wrote:
Oliver Hartkopp wrote:
Wolfgang Grandegger wrote:
Oliver Hartkopp wrote:
Andrew Morton wrote:
On Tue, 12 May 2009 11:28:00 +0200 Wolfgang Grandegger
<wg@xxxxxxxxxxxxxx> wrote:

+int can_restart_now(struct net_device *dev)
+{
+ struct can_priv *priv = netdev_priv(dev);
+ struct net_device_stats *stats = &dev->stats;
+ struct sk_buff *skb;
+ struct can_frame *cf;
+ int err;
+
+ /* Synchronize with dev->hard_start_xmit() */
+ netif_tx_lock(dev);
+
+ /* Ensure that no more messages can go out */
+ if (netif_carrier_ok(dev))
+ netif_carrier_off(dev);
+
+ /* Ensure that no more messages can come in */
+ err = priv->do_set_mode(dev, CAN_MODE_STOP);
+ if (err)
+ return err;
+
+ /* Now it's save to clean up */
+ del_timer_sync(&priv->restart_timer);
This is deadlockable.

It calls del_timer_sync() while holding netif_tx_lock(). But the timer
handler (can_restart_now()) also takes netif_tx_lock(). So if the
timer handler is presently running, it's sitting there spinning in
netif_tx_lock(). And del_timer_sync() is sitting there waiting for the
timer handler to complete.


Hi Wolfgang,

would it be an appropriate solution, just to invoke

netif_stop_queue() in can_bus_off()

and invoke

netif_wake_queue() in can_restart_now()

???

In a BUSOFF condition we're not able to send CAN frames anyway, so we
can disable the device queue and the we won't need any netif_tx_lock()
right?

AFAIK this was the original implementation before some of the latest
improvement with the netif_carrier stuff.

What do you think?
The problem is the "manual" restart triggered via the netlink interface,
which can occur in the middle of ndo_start_xmit().

Ah, i see.

What if the manual restart via netlink would also stop the queue and
start the timer?

It will not help if the restart is triggered in the middle of
ndo_start_xmit().


Hi Wolfgang,

i think, i found a solution that removes the locking problem completely:

When a bus-off occurs in the controller, the communication on the CAN bus can be treated as unusable for this controller (let's say it is dead).
E.g. the SJA1000 set's its reset bit for that reason and waits to be initialized by the CPU again.

So IMO restarting the CAN controller while in operational state is not a valid use case.

When a bus-off (interrupt) occurs, we should

- invoke netif_carrier_off(dev)
- invoke netif_stop_queue(dev)
- set the state to CAN_STATE_BUS_OFF

and of course create the error message, clear the interrupts(!) and then leave the irq service function.

That's it.

When the automatic CAN controller restart is enabled: start the timer.

For the manual netlink function: Test for CAN_STATE_BUS_OFF (!) and invoke the current can_restart_now(dev) or start the timer with e.g. 1ms ...

This approach should make it and fulfills the bus-off intention of the CAN controllers ("disabled and wait for re-initialisation").

And there's no locking of the tx_queue needed anymore as the tx_queue is already stopped, when the restart is performed.

What do you think about this approach?

Regards,
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/