Re: DMA cache consistency bug introduced in 2.6.28 (Was: Re: [Fdutils]Cannot format floppies under kernel 2.6.*?)

From: Alain Knaff
Date: Thu Dec 17 2009 - 17:11:46 EST


Linus Torvalds wrote:
>
> On Thu, 17 Dec 2009, Alain Knaff wrote:
>> For the moment, I have a very small sample of hardware:
>> 1. One machine which works (my own): Athlon XP 1800+ processor
>> 2. One which doesn't work (Mark's)
>
> Ok. I don't think I even have any machines with floppy drives any more
> (one external USB drive somewhere gathering dust just in case I ever
> encounter a floppy again).

Well, on my new box, I have no floppy drive either. The one I mentioned
is an old machine that I kept around just in case I needed to debug
floppy-related problems.

>> I might get access to a wider sample of boxen in a week or so, in order
>> to do some stats.
>
> Ok, I was more thinking "we have a bugzilla with ten different people
> reporting this". If it's just a single machine, that's not going to be
> relevant.

We do have a bugzilla
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=548434 , but
unfortunately it has only 2 people so far having seen the bug, one of
which (ael) turned out to be a false alert (dusty drive).

>
>> What's the easiest way to find out the chipset?
>>
>> Here's already the output of lspci from my machine (works):
>>
>> 00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge
>> 00:01.0 PCI bridge: VIA Technologies, Inc. VT8235 PCI Bridge
>> 00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
>
> Yeah, lspci (and generally only the northbridge and southbridge matters,
> the "ISA bridge" might technically be relevant, but since it's universally
> on the same die as the southbridge, I left it in there just for kicks).

Good. Here's some info about some machines of Mark which do have the
problem (there's more than one, fortunately):

1st one showing the problem (claimed to be AMD 790x chipset):

00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual
slot PCI-e_GFX and HT3 K8 part
00:02.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge
(external gfx0 port A)
00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller

2nd one showing the problem (also claimed to be AMD 790x chipset):

00:00.0 Host bridge: Advanced Micro Devices [AMD] RS780 Host Bridge
00:01.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge
(int gfx)
00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller

He also has several machines that do work:

1st one that does work:
00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)

... and a couple more where he didn't get around to test.

[...]
> Only the "it doesn't work on xyz" is likely interesting. The machines it
> works on are probably uninteresting statistically.

I understand... (working machine above just mentioned for completeness'
sake).

[...]
> You'd need a git tree that contains both the working and non-working
> versions, and then literally just do
>
> git bisect start
> git bisect good <known good version number here>
> git bisect bad <known bad version here>
>
> and it will give you a commit to try. Compile, test, see if it's good or
> bad, and do
>
> git bisect [good|bad]
>
> depending on the result. Rinse and repeat (depending on how tight the
> initial good/bad commits were, it will need 10-15 kernel tests).

... and how do I check out the most recent good / oldest bad kernel for
compilation?


> So in this case, since apparently 2.6.27.41 is good, and 2.6.28 is not, it
> would be something like this:
>
> # clone hpa's tree that has all the stable releases in one place
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-allstable.git
>
> cd linux-2.6-allstable
> git bisect start
> git bisect bad v2.6.28
> git bisect good v2.6.27.41
>
> and off you go.

ok...

> NOTE! Bisection depends very much on the bug being 100% reproducible. If
> you ever mark a good kernel bad (because you messed up) or a bad kernel
> good (because the bug wasn't 100% reproducible, so you _thought_ it was
> good even though the bug was present and just happened to hide), the end
> result of the bisect will be totally unreliable and seriously screwed up.
>
> So after a successful bisect, it is usually a good idea to try to go back
> to the original known-bad kernel, and then revert the commit that was
> indicated as the bad one (assuming the revert works - it could be that the
> bad one ends up being fundamental to other commits after it), and test
> that yes, that really fixes the bug.

What command lines would I use for that revert?

> It gets more complicated if the bisect hits kernels that you can't test
> because they have _unrelated_ issues on that machine (compile failures or
> just other bugs that hide the actual floppy behavior), but generally
> bisection is pretty simple. "man git-bisect" does have some extra
> pointers.
>
> So git bisect may be somewhat time-consuming and mindless, but for
> reliably triggering bugs where nobody really knows what caused the bug it
> is a _really_ convenient thing to do. The only thing you need is a
> reliably triggering test-case, and some time.
>
> Linus

Alain
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/