On Tue, 5 Dec 2006 17:48:05 +0000 (GMT)
"Maciej W. Rozycki" <macro@xxxxxxxxxxxxxx> wrote:
Essentially there is a race when disconnecting from a PHY, because
interrupt delivery uses the event queue for processing. The function to
handle interrupts that is called from the event queue is phy_change ().
It takes a pointer to a structure that is associated with the PHY. At the
time phy_stop_interrupts() is called there may be one or more calls to
phy_change() still pending on the event queue. They may not be able to be
processed until the structure passed to phy_change() have been freed, at
which point calling the function is wrong.
One way of avoiding it is calling flush_scheduled_work() from
phy_stop_interrupts(). This is fine as long as a caller of
phy_stop_interrupts() (not necessarily the immediate one calling into
libphy) does not hold the netlink lock.
So let me try to rephrase...
- phy_change() is the workqueue callback function. It is executed by
keventd.
- Something under phy_change() takes rtnl_lock() (but what??)
- phy_stop_interrupts() does flush_scheduled_work(). This has to
following logic:
- if I am kevetnd, run phy_change() directly.
- If I am not keventd, wait for keventd() to run phy_change()
- So if the caller of phy_stop_interrupt() already holds rtnl_lock(),
and if that caller is keventd then it will recur onto rntl_lock() and
will deadlock.
Problem is, if the caller of phy_stop_interrupt() is *not* keventd, that
caller will still deadlock, because that caller is waiting for keventd to
run phy_change(), and keventd cannot do that, because the not-keventd
process already holds rtnl_lock.
Now, afaict, there are only two callers of phy_stop_interrupts(): the
close() handlers of gianfar.c and fs_enet-main.c (confusingly held in
netdevice.stop (confusingly called by dev_close())). Via phy_disconnect.
Did I miss anything?
And the dev_close() caller holds rtnl_lock.