Re: [PATCH] firewire: core: send bus reset promptly on gap count error

From: Adam Goldman
Date: Sat Jan 27 2024 - 06:14:04 EST


Hi,

On Sat, Jan 27, 2024 at 05:37:30PM +0900, Takashi Sakamoto wrote:
> On Tue, Jan 23, 2024 at 12:11:40AM -0800, Adam Goldman wrote:
> > If we are bus manager and the bus has inconsistent gap counts, send a
> > bus reset immediately instead of trying to read the root node's config
> > ROM first. Otherwise, we could spend a lot of time trying to read the
> > config ROM but never succeeding.
> >
> > This eliminates a 50+ second delay before the FireWire bus is usable
> > after a newly connected device is powered on in certain circumstances.
>
> At first, would I request you to explain about the certain
> circumstances in the patch comment? It is really helpful to understand
> the change itself.

The delay occurs if a gap count inconsistency occurs, we are not the
root node, and we become bus manager. One scenario that causes this is
with a TI XIO2213B OHCI, the first time a Sony DSR-25 is powered on
after being connected to the FireWire cable.

> > Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/

This link has a longer description and kernel logs.


> > --- linux-source-6.1.orig/drivers/firewire/core-card.c 2023-09-23 02:11:13.000000000 -0700
> > +++ linux-source-6.1/drivers/firewire/core-card.c 2024-01-22 04:23:06.000000000 -0800
> > @@ -435,6 +435,16 @@
> > * config rom. In either case, pick another root.
> > */
> > new_root_id = local_id;
> > + } else if (card->gap_count == 0) {
> > + /*
> > + * If self IDs have inconsistent gap counts, do a
> > + * bus reset ASAP. The config rom read might never
> > + * complete, so don't wait for it. However, still
> > + * send a PHY configuration packet prior to the bus
> > + * reset, as permitted by IEEE 1394-2008 8.4.5.2.
> > + */
> > + new_root_id = local_id;
> > + card->bm_retries = 0;
> > } else if (!root_device_is_running) {
> > /*
> > * If we haven't probed this device yet, bail out now
>
> Next, after the condition branches, we can see below lines:
>
> ```
> /*
> * Finally, figure out if we should do a reset or not. If we have
> * done less than 5 resets with the same physical topology and we
> * have either a new root or a new gap count setting, let's do it.
> */
>
> if (card->bm_retries++ < 5 &&
> (card->gap_count != gap_count || new_root_id != root_id))
> do_reset = true;
> ```
>
> When the value of "card->gap_count" is zero, it would hit the condition of
> "card->gap_count != gap_count". I think the transmission of phy config
> packet and scheduling of short bus reset would be done, regardless of the
> change. Would I ask the main intention to the additional branch?

Without the additional branch, the !root_device_is_running branch will
be taken (because the root node's config ROM hasn't been read yet), and
bm_work will go to sleep. Eventually we will give up trying to read the
config ROM, the root_device==NULL branch will be taken, and the bus
reset will be done. The additional branch eliminates waiting for the
config ROM read when gap_count is zero.

Here is the full sequence of events:

1. Bus reset occurs due to newly active device.

2. Self identification process completes. We are not root node. Gap
counts are inconsistent.

3. build_tree() notices the gap count error and sets gap_count to 0:

> /*
> * If PHYs report different gap counts, set an invalid count
> * which will force a gap count reconfiguration and a reset.
> */
> if (SELF_ID_GAP_COUNT(q) != gap_count)
> gap_count = 0;

4. bm_work() starts and makes us bus manager:

> rcode = fw_run_transaction(card, TCODE_LOCK_COMPARE_SWAP,
> irm_id, generation, SCODE_100,
> CSR_REGISTER_BASE + CSR_BUS_MANAGER_ID,
> transaction_data, 8);

5. read_config_rom() starts reading the root node's config ROM.

6. bm_work() wants to know if the root node is cycle master capable. If
the root node is cycle master capable, we will leave it as the root
node. Otherwise, we will make ourself the root node. To determine if the
root node is cycle master capable, we must wait until its config ROM has
been read:

> } else if (!root_device_is_running) {
> /*
> * If we haven't probed this device yet, bail out now
> * and let's try again once that's done.
> */
> spin_unlock_irq(&card->lock);
> goto out;

7a. Without the patch: read_config_rom() reads the first few quadlets
from the config ROM. Due to the gap count inconsistency, eventually one
of the reads times out. When read_config_rom() fails, fw_device_init()
calls it again until MAX_RETRIES is reached. This takes 50+ seconds.

8a. bm_work() sees that we have given up trying to read the config ROM.
It makes us the root node and does a bus reset:

> if (root_device == NULL) {
> /*
> * Either link_on is false, or we failed to read the
> * config rom. In either case, pick another root.
> */
> new_root_id = local_id;
> ...
> if (card->bm_retries++ < 5 &&
> (card->gap_count != gap_count || new_root_id != root_id))
> do_reset = true;

7b. With the patch: Because of the gap count inconsistency, bm_work()
does not wait for the config ROM to be read. It makes us the root node
and does a bus reset immediately:

> } else if (card->gap_count == 0) {
> /*
> * If self IDs have inconsistent gap counts, do a
> * bus reset ASAP. The config rom read might never
> * complete, so don't wait for it. However, still
> * send a PHY configuration packet prior to the
> * bus reset, as permitted by 1394-2008 8.4.5.2.
> */
> new_root_id = local_id;
> card->bm_retries = 0;
> ...
> if (card->bm_retries++ < 5 &&
> (card->gap_count != gap_count || new_root_id != root_id))
> do_reset = true;

-- Adam