Re: [PATCHv3 2/2] mtd: m25p80: restore the status of SPI flash when exiting

From: NeilBrown
Date: Mon Jul 23 2018 - 21:52:09 EST


On Tue, Jul 24 2018, Boris Brezillon wrote:

> On Tue, 24 Jul 2018 08:46:33 +1000
> NeilBrown <neilb@xxxxxxxx> wrote:
>
>> On Mon, Jul 23 2018, Brian Norris wrote:
>>
>> > Hi Boris,
>> >
>> > On Mon, Jul 23, 2018 at 1:10 PM, Boris Brezillon
>> > <boris.brezillon@xxxxxxxxxxx> wrote:
>> >> On Mon, 23 Jul 2018 11:13:50 -0700
>> >> Brian Norris <computersforpeace@xxxxxxxxx> wrote:
>> >>> I noticed this got merged, but I wanted to put my 2 cents in here:
>> >>
>> >> I wish you had replied to this thread when it was posted (more than
>> >> 6 months ago). Reverting the patch now implies making some people
>> >> unhappy because they'll have to resort to their old out-of-tree
>> >> hacks :-(.
>> >
>> > I'd say I'm sorry for not following things closely these days, but I'm
>> > not really that sorry. There are plenty of other capable hands. And if
>> > y'all shoot yourselves in the foot, so be it. This patch isn't going
>> > to blow things up, but now that I did finally notice it (because it
>> > happened to show up in a list of backports I was looking at), I
>> > thought better late than never to remind you.
>> >
>> > For way of notification: Marek already noticed that we've started down
>> > a slippery slope months ago:
>> >
>> > https://lkml.org/lkml/2018/4/8/141
>> > Re: [PATCH] mtd: spi-nor: clear Extended Address Reg on switch to
>> > 3-byte addressing.
>> >
>> > I'm not quite sure why that wasn't taken to its logical conclusion --
>> > that the hack should be reverted.
>> >
>> > This problem has been noted many times already, and we've always
>> > stayed on the side of *avoiding* this hack. A few references from a
>> > search of my email:
>> >
>> > http://lists.infradead.org/pipermail/linux-mtd/2013-March/046343.html
>> > [PATCH 1/3] mtd: m25p80: utilize dedicated 4-byte addressing commands
>> >
>> > http://lists.infradead.org/pipermail/barebox/2014-September/020682.html
>> > [RFC] MTD m25p80 3-byte addressing and boot problem
>> >
>> > http://lists.infradead.org/pipermail/linux-mtd/2015-February/057683.html
>> > [PATCH 2/2] m25p80: if supported put chip to deep power down if not used
>> >
>> >>> On Wed, Dec 06, 2017 at 10:53:42AM +0800, Zhiqiang Hou wrote:
>> >>> > From: Hou Zhiqiang <Zhiqiang.Hou@xxxxxxx>
>> >>> >
>> >>> > Restore the status to be compatible with legacy devices.
>> >>> > Take Freescale eSPI boot for example, it copies (in 3 Byte
>> >>> > addressing mode) the RCW and bootloader images from SPI flash
>> >>> > without firing a reset signal previously, so the reboot command
>> >>> > will fail without reseting the addressing mode of SPI flash.
>> >>> > This patch implement .shutdown function to restore the status
>> >>> > in reboot process, and add the same operation to the .remove
>> >>> > function.
>> >>>
>> >>> We have previously rejected this patch multiple times, because the above
>> >>> comment demonstrates a broken product.
>> >>
>> >> If we were to only support working HW parts, I fear Linux would not
>> >> support a lot of HW (that's even more true when it comes to flashes :P).
>> >
>> > You stopped allowing UBI to attach to MLC NAND recently, no? That
>> > sounds like almost the same boat -- you've probably killed quite a few
>> > shitty products, if they were to use mainline directly.
>> >
>> > Anyway, that's derailing the issue. Supporting broken hardware isn't
>> > something you try to do by applying the same hack to all systems. You
>> > normally try to apply your hack as narrowly as possible. You seem to
>> > imply that below. So maybe that's a solution to move forward with. But
>> > I'd personally be just as happy to see the patch reverted.
>> >
>> >>> You cannot guarantee that all
>> >>> reboots will invoke the .shutdown() method -- what about crashes? What
>> >>> about watchdog resets? IIUC, those will hit the same broken behavior,
>> >>> and have unexepcted behavior in your bootloader.
>> >>
>> >> Yes, there are corner cases that are not addressed with this approach,
>> >
>> > Is a system crash really a corner case? :D
>> >
>> >> but it still seems to improve things. Of course, that means the
>> >> user should try to re-route all HW reset sources to SW ones (RESET input
>> >> pin muxed to the GPIO controller, watchdog generating an interrupt
>> >> instead of directly asserting the RESET output pin), which is not always
>> >> possible, but even when it's not, isn't it better to have a setup that
>> >> works fine 99% of the time instead of 50% of the time?
>> >
>> > Perhaps, but not at the expense of future development. And
>> > realistically, no one is doing that if they have this hack. Most
>> > people won't even know that this hack is protecting them at all (so
>> > again, they won't try to mitigate the problem any further).
>> >
>> >>> I suppose one could argue for doing this in remove(), but AIUI you're
>> >>> just papering over system bugs by introducing the shutdown() function
>> >>> here. Thus, I'd prefer we drop the shutdown() method to avoid misleading
>> >>> other users of this driver.
>> >>
>> >> I understand your point. But if the problem is about making sure people
>> >> designing new boards get that right, why not complaining at probe time
>> >> when things are wrong?
>> >>
>> >> I mean, spi_nor_restore() seems to only do something on very specific
>> >> NORs (those on which a SW RESET does not resets the addressing
>> >> mode).
>> >
>> > The point isn't that SW RESET doesn't reset the addressing mode -- it
>> > does on any flash I've seen. The point is that most systems are built
>> > around a stateless assumption in these flash. IIRC, there wasn't even
>> > a SW RESET command at all until these "huge" flash came around and
>> > stateful addressing modes came about. So boot ROMs and bootloaders
>> > would have to be updated to start figuring out when/how to do this SW
>> > RESET. And once two vendors start doing it differently (I'm not sure:
>> > have they done this already? I think so) it's no longer something a
>> > boot ROM will get right.
>> >
>> > The only way to get this stuff right is to have a hardware reset, or
>> > else to avoid all of the stateful modes in software.
>> >
>> >> So, how about adding a flag that says "my board has the NOR HW
>> >> RESET pin wired" (there would be a DT props to set that flag). Then you
>> >> add a WARN_ON() when this flag is not set and a NOR chip impacted by
>> >> this bug is detected.
>> >
>> > I'd kinda prefer the reverse. There really isn't a need to document
>> > anything for a working system (software usually can't control this
>> > RESET pin). The burden should be on the b0rked system to document
>> > where it needs unsound hacks to survive.
>> >
>> >> This way you make sure people are informed that
>> >> they're doing something wrong, and for those who can't change their HW
>> >> (because it's already widely deployed), you have a fix that improve
>> >> things.
>> >
>> > Or even better: put this hack behind a DT flag, so that one has to
>> > admit that their board design is broken before it will even do
>> > anything. Proposal: "linux,badly-designed-flash-reset".
>> >
>> > But, I'd prefer just (partially?) reverting this, and let the authors
>> > submit something that works. We're not obligated to keep bad hacks in
>> > the kernel.
>> >
>> > Brian
>>
>> One possibility that occurred to me when I was exploring this issue is
>> to revert to 3-byte mode whenever 4-byte was not actively in use.
>> So any access beyond 16Meg is:
>> switch-to-4-byte ; perform IO ; switch to 3-byte
>> or similar. On my hardware it would be more efficient to
>> use the 4-byte opcode to perform the IO, then reset the cached
>> 4th address byte that the NOR chip transparently remembered.
>>
>> This adds a little overhead, but should be fairly robust.
>> It doesn't help if something goes terribly wrong while IO is happening,
>> but I don't think any other software solution does either.
>>
>> How would you see that approach?
>
> I think the problem stands: people that have proper HW mitigation for
> this problem (NOR chip is reset when the Processor is reset) don't want
> to pay the overhead. So, even if we go for this approach, we probably
> want to only do that for broken HW.

I agree that a "my-hardware-is-suboptimal" flag is appropriate.
I was addressing the suggestion that the current approach doesn't handle
all corner cases and was suggesting a different approach that might
handle more corner-cases. I should have been more explicit about that.

Thanks,
NeilBrown

Attachment: signature.asc
Description: PGP signature