Re: [PATCH] firewire: core: send bus reset promptly on gap count error

From: Takashi Sakamoto
Date: Wed Jan 31 2024 - 09:27:29 EST


Hi,

I'm sorry to be late for reply.

On Sat, Jan 27, 2024 at 03:13:45AM -0800, Adam Goldman wrote:
> Hi,
>
> On Sat, Jan 27, 2024 at 05:37:30PM +0900, Takashi Sakamoto wrote:
> > On Tue, Jan 23, 2024 at 12:11:40AM -0800, Adam Goldman wrote:
> > > If we are bus manager and the bus has inconsistent gap counts, send a
> > > bus reset immediately instead of trying to read the root node's config
> > > ROM first. Otherwise, we could spend a lot of time trying to read the
> > > config ROM but never succeeding.
> > >
> > > This eliminates a 50+ second delay before the FireWire bus is usable
> > > after a newly connected device is powered on in certain circumstances.
> >
> > At first, would I request you to explain about the certain
> > circumstances in the patch comment? It is really helpful to understand
> > the change itself.
>
> The delay occurs if a gap count inconsistency occurs, we are not the
> root node, and we become bus manager. One scenario that causes this is
> with a TI XIO2213B OHCI, the first time a Sony DSR-25 is powered on
> after being connected to the FireWire cable.
>
> > > Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/
>
> This link has a longer description and kernel logs.

Of course, I've already read your post in the last week. Yes, you did
enough investigation. However, in the review process, you need to post
patches with sufficient description.

> > > --- linux-source-6.1.orig/drivers/firewire/core-card.c 2023-09-23 02:11:13.000000000 -0700
> > > +++ linux-source-6.1/drivers/firewire/core-card.c 2024-01-22 04:23:06.000000000 -0800
> > > @@ -435,6 +435,16 @@
> > > * config rom. In either case, pick another root.
> > > */
> > > new_root_id = local_id;
> > > + } else if (card->gap_count == 0) {
> > > + /*
> > > + * If self IDs have inconsistent gap counts, do a
> > > + * bus reset ASAP. The config rom read might never
> > > + * complete, so don't wait for it. However, still
> > > + * send a PHY configuration packet prior to the bus
> > > + * reset, as permitted by IEEE 1394-2008 8.4.5.2.
> > > + */
> > > + new_root_id = local_id;
> > > + card->bm_retries = 0;
> > > } else if (!root_device_is_running) {
> > > /*
> > > * If we haven't probed this device yet, bail out now
> >
> > Next, after the condition branches, we can see below lines:
> >
> > ```
> > /*
> > * Finally, figure out if we should do a reset or not. If we have
> > * done less than 5 resets with the same physical topology and we
> > * have either a new root or a new gap count setting, let's do it.
> > */
> >
> > if (card->bm_retries++ < 5 &&
> > (card->gap_count != gap_count || new_root_id != root_id))
> > do_reset = true;
> > ```
> >
> > When the value of "card->gap_count" is zero, it would hit the condition of
> > "card->gap_count != gap_count". I think the transmission of phy config
> > packet and scheduling of short bus reset would be done, regardless of the
> > change. Would I ask the main intention to the additional branch?
>
> Without the additional branch, the !root_device_is_running branch will
> be taken (because the root node's config ROM hasn't been read yet), and
> bm_work will go to sleep. Eventually we will give up trying to read the
> config ROM, the root_device==NULL branch will be taken, and the bus
> reset will be done. The additional branch eliminates waiting for the
> config ROM read when gap_count is zero.
>
> Here is the full sequence of events:
>
> 1. Bus reset occurs due to newly active device.
>
> 2. Self identification process completes. We are not root node. Gap
> counts are inconsistent.
>
> 3. build_tree() notices the gap count error and sets gap_count to 0:
>
> > /*
> > * If PHYs report different gap counts, set an invalid count
> > * which will force a gap count reconfiguration and a reset.
> > */
> > if (SELF_ID_GAP_COUNT(q) != gap_count)
> > gap_count = 0;
>
> 4. bm_work() starts and makes us bus manager:
>
> > rcode = fw_run_transaction(card, TCODE_LOCK_COMPARE_SWAP,
> > irm_id, generation, SCODE_100,
> > CSR_REGISTER_BASE + CSR_BUS_MANAGER_ID,
> > transaction_data, 8);
>
> 5. read_config_rom() starts reading the root node's config ROM.
>
> 6. bm_work() wants to know if the root node is cycle master capable. If
> the root node is cycle master capable, we will leave it as the root
> node. Otherwise, we will make ourself the root node. To determine if the
> root node is cycle master capable, we must wait until its config ROM has
> been read:
>
> > } else if (!root_device_is_running) {
> > /*
> > * If we haven't probed this device yet, bail out now
> > * and let's try again once that's done.
> > */
> > spin_unlock_irq(&card->lock);
> > goto out;
>
> 7a. Without the patch: read_config_rom() reads the first few quadlets
> from the config ROM. Due to the gap count inconsistency, eventually one
> of the reads times out. When read_config_rom() fails, fw_device_init()
> calls it again until MAX_RETRIES is reached. This takes 50+ seconds.
>
> 8a. bm_work() sees that we have given up trying to read the config ROM.
> It makes us the root node and does a bus reset:
>
> > if (root_device == NULL) {
> > /*
> > * Either link_on is false, or we failed to read the
> > * config rom. In either case, pick another root.
> > */
> > new_root_id = local_id;
> > ...
> > if (card->bm_retries++ < 5 &&
> > (card->gap_count != gap_count || new_root_id != root_id))
> > do_reset = true;
>
> 7b. With the patch: Because of the gap count inconsistency, bm_work()
> does not wait for the config ROM to be read. It makes us the root node
> and does a bus reset immediately:
>
> > } else if (card->gap_count == 0) {
> > /*
> > * If self IDs have inconsistent gap counts, do a
> > * bus reset ASAP. The config rom read might never
> > * complete, so don't wait for it. However, still
> > * send a PHY configuration packet prior to the
> > * bus reset, as permitted by 1394-2008 8.4.5.2.
> > */
> > new_root_id = local_id;
> > card->bm_retries = 0;
> > ...
> > if (card->bm_retries++ < 5 &&
> > (card->gap_count != gap_count || new_root_id != root_id))
> > do_reset = true;

Thanks for the detail.

The mismatch of gap count can be detected in the firstworkqueue context
(bus_reset_work()) in 'driver/firewire/ohci.c', so I investigated more
quick way to reset the bus, however your change looks better than it.

I'd like to apply your patch for v6.9 kernel, while the patch
description is not suffice, as I address. I'm sorry to ask you more
work, but would I ask you to repost your patch with any detail
description? The description should include enough information for
developers inner/outer this subsystem to understand its background
and intention.


Thanks

Takashi Sakamoto