Re: [Issue report] drivers/ftgmac100: DHCP occasionally fails during boot up or link down/up

From: Heyi Guo
Date: Sat Feb 19 2022 - 05:08:45 EST


Hi Andrew,

The DHCP issue is gone after applying below patch. I put the lock statements outside of the pure reset function, for the phydev lock has been acquired before calling adjust_link. The lock order in ftgmac100_reset_task() was also changed, to make it the same as the lock procedure in adjust_link, in which the phydev is locked first and then rtnl_lock. I'm not quite sure whether it will bring in any potential dead lock. Any advice?

Thanks,

Heyi

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 1c7912a94e36d..9610b59ca0876 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1002,6 +1002,8 @@ static int ftgmac100_alloc_rx_buffers(struct ftgmac100 *priv)
        return 0;
 }

+static void ftgmac100_reset(struct ftgmac100 *priv);
+
 static void ftgmac100_adjust_link(struct net_device *netdev)
 {
        struct ftgmac100 *priv = netdev_priv(netdev);
@@ -1050,8 +1052,14 @@ static void ftgmac100_adjust_link(struct net_device *netdev)
        /* Disable all interrupts */
        iowrite32(0, priv->base + FTGMAC100_OFFSET_IER);

-       /* Reset the adapter asynchronously */
-       schedule_work(&priv->reset_task);
+       if (priv->mii_bus)
+               mutex_lock(&priv->mii_bus->mdio_lock);
+       /* Lock the world */
+       rtnl_lock();
+       ftgmac100_reset(priv);
+       rtnl_unlock();
+       if (priv->mii_bus)
+               mutex_unlock(&priv->mii_bus->mdio_lock);
 }

 static int ftgmac100_mii_probe(struct ftgmac100 *priv, phy_interface_t intf)
@@ -1388,26 +1396,17 @@ static int ftgmac100_init_all(struct ftgmac100 *priv, bool ignore_alloc_err)
        return err;
 }

-static void ftgmac100_reset_task(struct work_struct *work)
+static void ftgmac100_reset(struct ftgmac100 *priv)
 {
-       struct ftgmac100 *priv = container_of(work, struct ftgmac100,
-                                             reset_task);
        struct net_device *netdev = priv->netdev;
        int err;

        netdev_dbg(netdev, "Resetting NIC...\n");

-       /* Lock the world */
-       rtnl_lock();
-       if (netdev->phydev)
-               mutex_lock(&netdev->phydev->lock);
-       if (priv->mii_bus)
-               mutex_lock(&priv->mii_bus->mdio_lock);
-

        /* Check if the interface is still up */
        if (!netif_running(netdev))
-               goto bail;
+               return;

        /* Stop the network stack */
        netif_trans_update(netdev);
@@ -1429,12 +1428,29 @@ static void ftgmac100_reset_task(struct work_struct *work)
        ftgmac100_init_all(priv, true);

        netdev_dbg(netdev, "Reset done !\n");
- bail:
+}
+
+static void ftgmac100_reset_task(struct work_struct *work)
+{
+       struct ftgmac100 *priv = container_of(work, struct ftgmac100,
+                                             reset_task);
+       struct net_device *netdev = priv->netdev;
+
+       int err;
+       /* Lock the world */
+       if (netdev->phydev)
+               mutex_lock(&netdev->phydev->lock);
+       if (priv->mii_bus)
+               mutex_lock(&priv->mii_bus->mdio_lock);
+       rtnl_lock();
+
+       ftgmac100_reset(priv);
+
+       rtnl_unlock();
        if (priv->mii_bus)
                mutex_unlock(&priv->mii_bus->mdio_lock);
        if (netdev->phydev)
                mutex_unlock(&netdev->phydev->lock);
-       rtnl_unlock();
 }

 static int ftgmac100_open(struct net_device *netdev)

在 2022/2/16 上午4:50, Andrew Lunn 写道:
On Tue, Feb 15, 2022 at 02:38:51PM +0800, Heyi Guo wrote:
Hi,

We are using Aspeed 2600 and found DHCP occasionally fails during boot up or
link down/up. The DHCP client is systemd 247.6 networkd. Our network device
is 2600 MAC4 connected to a RGMII PHY module.

Current investigation shows the first DHCP discovery packet sent by
systemd-networkd might be corrupted, and sysmtemd-networkd will continue to
send DHCP discovery packets with the same XID, but no other packets, as
there is no IP obtained at the moment. However the server side will not
respond with this serial of DHCP requests, until it receives some other
packets. This situation can be recovered by another link down/up, or a "ping
-I eth0 xxx.xxx.xxx.xxx" command to insert some other TX packets.

Navigating the driver code ftgmac.c, I've some question about the work flow
from link down to link up. I think the flow is as below:

1. ftgmac100_open() will enable net interface with ftgmac100_init_all(), and
then call phy_start()

2. When PHY is link up, it will call netif_carrier_on() and then adjust_link
interface, which is ftgmac100_adjust_link() for ftgmac100
The order there is questionable. Maybe it should first call the adjust
link callback, and then the netif_carrier_on(). However...

3. In ftgmac100_adjust_link(), it will schedule the reset work
(ftgmac100_reset_task)

4. ftgmac100_reset_task() will then reset the MAC
Because of this delayed reset, changing the order will not help this
driver.

I found networkd will start to send DHCP request immediately after
netif_carrier_on() called in step 2, but step 4 will reset the MAC, which
may potentially corrupt the sending packet.
What is not clear to my is why it is scheduling the work rather than
just doing it. At least for adjust_link, it is in a context it can
sleep. ftgmac100_set_ringparam() should also be able to
sleep. ftgmac100_interrupt() cannot sleep, so it does need to schedule
work.

I would suggest you refactor ftgmac100_reset_task() into a function
that actually does the reset, and a wrapper which takes a
work_struct. adjust_link can then directly do the reset, which
probably solves your problem.

Andrew