Re: [PATCH v2 5/5] r8152: Block future register access if register access fails

From: Doug Anderson
Date: Wed Oct 04 2023 - 16:12:39 EST


Hi,

On Wed, Oct 4, 2023 at 12:27 PM Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Even though the functions to read/write registers can fail, most of
> the places in the r8152 driver that read/write register values don't
> check error codes. The lack of error code checking is problematic in
> at least two ways.
>
> The first problem is that the r8152 driver often uses code patterns
> similar to this:
> x = read_register()
> x = x | SOME_BIT;
> write_register(x);
>
> ...with the above pattern, if the read_register() fails and returns
> garbage then we'll end up trying to write modified garbage back to the
> Realtek adapter. If the write_register() succeeds that's bad. Note
> that as of commit f53a7ad18959 ("r8152: Set memory to all 0xFFs on
> failed reg reads") the "garbage" returned by read_register() will at
> least be consistent garbage, but it is still garbage.
>
> It turns out that this problem is very serious. Writing garbage to
> some of the hardware registers on the Ethernet adapter can put the
> adapter in such a bad state that it needs to be power cycled (fully
> unplugged and plugged in again) before it can enumerate again.
>
> The second problem is that the r8152 driver generally has functions
> that are long sequences of register writes. Assuming everything will
> be OK if a random register write fails in the middle isn't a great
> assumption.
>
> One might wonder if the above two problems are real. You could ask if
> we would really have a successful write after a failed read. It turns
> out that the answer appears to be "yes, this can happen". In fact,
> we've seen at least two distinct failure modes where this happens.
>
> On a sc7180-trogdor Chromebook if you drop into kdb for a while and
> then resume, you can see:
> 1. We get a "Tx timeout"
> 2. The "Tx timeout" queues up a USB reset.
> 3. In rtl8152_pre_reset() we try to reinit the hardware.
> 4. The first several (2-9) register accesses fail with a timeout, then
> things recover.
>
> The above test case was actually fixed by the patch ("r8152: Increase
> USB control msg timeout to 5000ms as per spec") but at least shows
> that we really can see successful calls after failed ones.
>
> On a different (AMD) based Chromebook with a particular adapter, we
> found that during reboot tests we'd also sometimes get a transitory
> failure. In this case we saw -EPIPE being returned sometimes. Retrying
> worked, but retrying is not always safe for all register accesses
> since reading/writing some registers might have side effects (like
> registers that clear on read).
>
> Let's fully lock out all register access if a register access fails.
> When we do this, we'll try to queue up a USB reset and try to unlock
> register access after the reset. This is slightly tricker than it
> sounds since the r8152 driver has an optimized reset sequence that
> only works reliably after probe happens. In order to handle this, we
> avoid the optimized reset if probe didn't finish.
>
> When locking out access, we'll use the existing infrastructure that
> the driver was using when it detected we were unplugged. This keeps us
> from getting stuck in delay loops in some parts of the driver.
>
> - Reset patch no longer based on retry patch, since that was dropped.
> - Reset patch should be robust even if failures happen in probe.
> - Switched booleans to bits in the "flags" variable.
> - Check for -ENODEV instead of "udev->state == USB_STATE_NOTATTACHED"

Argh, the above 4 bullet points were supposed to get moved "after the
cut" and describe the differences between v1 and v2. Sorry! :(

To avoid spamming people, I won't send another version and will wait
for feedback. I'm happy to fix and send a new version at any time,
though.