Re: enhance ONFI table reliability/stable

From: Boris Brezillon
Date: Sat Nov 21 2015 - 02:46:19 EST


On Fri, 20 Nov 2015 15:59:27 -0800
Brian Norris <computersforpeace@xxxxxxxxx> wrote:

> On Thu, Nov 19, 2015 at 04:21:01AM +0000, Bean Huo éææ (beanhuo) wrote:
> > > On Tue, Jul 21, 2015 at 02:42:34PM +0000, Bean Huo éææ (beanhuo)
> > > wrote:
> > > > Hi,
> > > >
> > > > Recently, I faced some case about ONFI table reliability, now it used CRC.
> > > > If there is bit flips in ONFI parameter pages, parameter backup page will be
> > > taken.
> > > > For latest linux,default read three copys.
> > > >
> > > > chip->cmdfunc(mtd, NAND_CMD_PARAM, 0, -1);
> > > > for (i = 0; i < 3; i++) {
> > > > for (j = 0; j < sizeof(*p); j++)
> > > > ((uint8_t *)p)[j] = chip->read_byte(mtd);
> > > > if (onfi_crc16(ONFI_CRC_BASE, (uint8_t *)p, 254) ==
> > > > le16_to_cpu(p->crc)) {
> > > > break;
> > > > }
> > > > }
> > > >
> > > > However ,with technoogy improvement,for TLC and new generatin MLC,I
> > > > think, three copys of
> > >
> > > Ha, "improvement" :)
> > >
> > > > Parameter tables is not powerful enough.my question is that if there
> > > > is a good method to protect and corrent parameter page. For example,we
> > > > can use linux software BCH ecc. Any suggections and input be
> > > > welcomed,if you having any concerns about this,don't free tell me.
> > >
> > > I recall this being brought up at my old job, and I all I can say is...
> > > (please pardon my censored language)
> >
> >
> > Yes , you ever told about this. I just follow.
> > Sorry for my rude following.
> > I only want to share my one suggestion about using software ECC to protect
> > ONFI table that read from NAND. I want to hear every MTD expert 's valuable
> > Feedback on this. if OK, I can do it.
>
> Perhaps I'm misunderstanding you, I don't understand how you could
> possibly "do it" if it is a circular dependency. You have nowhere to
> store ECC/parity data for a parameter page, because you can't actually
> read/write the NAND flash until after you know its geometry.

Well, while I agree with most of your answer (why the hell are NAND
vendors storing the ONFI parameter page, and other sensitive information
in normal NAND pages, especially when we're talking about TLC/MLC
NANDs???), it's perfectly possible to have ECC in this case, as long as
the geometry is known in advance (at least this is true for BCH).

Say you have only 3 copies of the parameter page and ECC are stored
after that. You can define the following layout:

|3 x parameter page size|3 x ECC bytes|

Of course this implies reserving the space after the 3 parameter pages
for the ECC bytes, which according to the current ONFI spec is not true
(you should have at least 3 copies, but you can have more).

And we would choose the ECC geometry with this logic:

ECC chunk size = sizeof(struct nand_onfi_params)
ECC strength = iteratively tested with different pre-defined values

This being said, I don't know how you would change the ONFI spec and
keep it compatible with the previous version. As I said, the current
version of the spec does not reserve any area after the mandatory
parameter pages...
You'll probably have to add a NAND_CMD_ALT_PARAM to support this kind
of thing.

>
> > > ...that is complete and utter bulls***. An ONFI standard that can't guarantee
> > > "reliable enough" parameter pages is no standard at all.
> > >
> > > To step back a bit: How would one expect to store and retrieve ECC parity
> > > data? ...on the NAND flash? But to do that, we have to know the geometry
> > > parameters of said NAND flash. How do we figure out the geometry? From the
> > > ONFI parameter pages! Nice Catch 22 you have there.
>
> I realize a non-native English speaker might not understand the "Catch
> 22" reference. Wikipedia has a nice summary:
>
> https://en.wikipedia.org/wiki/Catch-22_(logic)
>
> Essentially, it's a circular argument, or a contradiction. An
> impossibility.
>
> > > Please encourage your employer never to produce "ONFI-compliant" flash that
> > > are this bad.
>
> I still stand by the above statement.
>
> But now that I'm in a slightly more charitable mood, there are ways to
> improve our ability to recover from slightly corrupted parameter pages
> (ECC is not one of them).
>
> For one, you could do some kind of bit majority. e.g.:
>
> (1) try pages 1-3
> (2) if none pass the CRC check, then compute bit majority of all 3; if
> the CRC of this combined page passes, then use it
> (3) ???

Should work too, but it's probably less reliable than BCH ECC (we only
have 3 copies :-/).

Best Regards,

Boris

--
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/