Re: Triton DMA

Rogier Wolff (R.E.Wolff@BitWizard.nl)
Sat, 29 Nov 1997 22:32:19 +0100 (MET)


Gerard Roudier wrote:
>
>
> On Sat, 29 Nov 1997, Rogier Wolff wrote:
>
> > Gerard Roudier wrote:
> > >
> > >
> > > On Sat, 29 Nov 1997, Rogier Wolff wrote:
> > > >
> > > > In the beginning, parity was considered reasonable: Measurements
> > > > showed that say only one in a million bits went wrong. In that
> > > > situation, using parity is not that bad: there is just a 1 chance in a
> > > > million that a second bit error occurs in that same byte. You
> > > > possibly miss just one error in a million, 999,999 are flagged
> > > > correctly. This means that 1 in 1.25e11 bytes is incorrectly flagged
> > > > as correct while in reality it is wrong.
> > > >
> > > > However nowadays we know, that it doesn't always work like that. You
> > > > might have a 1 in four million chance of a BYTE going wrong, with an
> > > > average of 4 bits wrong in that byte (i.e. the byte is completely
> > > > random). Still just one bit in a million is wrong, but a completely
> > > > random byte has a 50/50 chance of getting the right parity by
> > > > accident. So now you're getting 1 byte in 1e6 bytes flagged
> > > > as correct while in reality it is wrong.
> > > >
> > > > In the first case, you get one error per day of full-time copying. In
> > > > the second case you get 5 errors per second. (Assuming 5Mb per
> > > > second).
> >
> > > Thanks for the explanation.
> > > Btw, I did not observe that my Ultra Wide SCSI BUS ever got 40
> > > undetected errors per second. But I read that IDE DMA may corrupt
> > > data and noticed that PIO is often recommended against DMA.
> >
> > Stop! I just took a "one bit in a million" as an example. The real
> > rate may be 1000 or 1000000 less often, leading to error rates that
> > are a little more beleivable.
>
> You did'nt wrote it was an example. Did I miss something?

Yep. (Don't take offence).

> > > > showed that say only one in a million bits went wrong. In that
^^^

Yeah. Come to reread it the "measurements" part may make it look as
scientific data. But You didn't think that I could tell you the error
rate on YOUR SCSI cable in one simple number without sofisticated
equipment near the thing did you?

> > Incorrectly terminated SCSI busses or too long a SCSI bus lead to
> > erratic behaviour. Same (cable too long, or improper termination)
> > goes for IDE. (From the 16Mb/sec mode upwards, the motherboard side
> > of the cable has to be terminated.)
> >
> > > My opinion about IDE BUS is that it is not a suitable IO bus for mass
> > > storage devices, but looks like some extension of some system bus, since
> > it is.
> > > it is neither terminated, nor uses differential signals.
> > > If I enjoyed driving trabants with race car engine, I would probably
> > > use IDE Ultra 33 devices.
> >
> > Gerard, you do need to realize that a 32bit CRC detects all single
> > byte, double byte, and triple byte errors. It detects (2^-32)-1 out of
> > 2^32 of all quad byte and longer errors. Upto bursts of 64 bytes of
> > random data, this has a better chance of catching real errors than a
> > per-byte parity.
>
> A 32 bit value is only able to tag 2^32 differents bit strings.
> So, IMHO, the efficiency of a CRC depends on its algorithm with regard
> to the typology of error patterns that are probable to happen.
>
> One who wants to convince that this CRC is superior to something else
> must explained why and not just claim.
>
> What you write above (e.g. a 32bit CRC detects _all_ ...) can only be
> _false_ in my opinion.
> Note that I think that this CRC is very probably efficient since error
> patterns are not quite random.
> I just want to get explanations about this CRC capabilities and did'nt
> get any for the moment ...

Mark just "admitted" that it is only a 16 bit CRC. So lets start using
this number.

CRCs are based on a Linear Feedback Shift Regster (LFSR). With just a
few tricks you can design one that has a "cycle time" that achieves
the maximum theoretically possible (65536 for a 16 bit LFSR). If you
base a CRC on this, all perturbations shorter than the shift register
length have to end up with a different value in the register, as
otherwise the LFSR would be back to the beginning in less than the
theoretically max.

> > You're right that an unacceptable overhead would be incurred if
> > software would need to calculate the CRC. As to the speed of
> > calculating a CRC against that of parity, both can be implemented in
> > hardware with just a few xor gates.
>
> I donnot have any doc about this CRC.
> Could you let me know what algorithm or polynom is used ?

> > Gerard, may I ask you a question on YOUR field of expertise? I got a
>
> Are you aware that you are using a technology that may encounter a not
> detected error for _each_ byte transmitted with a probability of 1/2?
> :) :) :) :)

Yep, I'm using BOTH IDE and SCSI. I must be nuts :-)

> > With this as the hardware situtation, my machine once locked. I would
> > expect SERIOUS failures when my system would be using SCSI as the
> > root-device, but as it is, just a few "large" storage partitions were
> > mounted on the SCSI disks. With bad termination, I'd expect parity
> > errors, timeouts, but not a complete lockup.
>
> Agreed. Normally, even if the NCR chip is locked, the SCSI middle driver
> should ask the driver to reset the controller when a scsi abort request
> times out and all should restart. Something went wrong, perhaps at SCSI
> drivers level (including low-level one), perhaps in some other part of
> the kernel. Error recovery is very hard to test and it is IMO the
> weak point of the current Linux SCSI stack.

Well, :-) May I suggest you take off the termination of your bus :-)
you might get a nice chance to test error recovery within a few
minutes :-) (*)

Roger.

(*) Despite the smileys, I'm half serious.

-- 
** R.E.Wolff@BitWizard.nl ** +31-15-2137555 ** http://www.BitWizard.nl/ **
Florida -- A 39 year old construction worker woke up this morning when a
109-car freight train drove over him. According to the police the man was 
drunk. The man himself claims he slipped while walking the dog. 080897