Re: RFC: starting a kernel-testers group for newbies

From: Arjan van de Ven
Date: Thu May 01 2008 - 08:16:02 EST


On Thu, 1 May 2008 01:13:46 -0700
Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Wed, 30 Apr 2008 00:03:38 -0700 Arjan van de Ven
> <arjan@xxxxxxxxxxxxx> wrote:
>
> > > First of all:
> > > I 100% agree with Andrew that our biggest problems are in
> > > reviewing code and resolving bugs, not in finding bugs (we
> > > already have far too many unresolved bugs).
> >
> > I would argue instead that we don't know which bugs to fix first.
>
> <boggle>
>
> How about "a bug which we just added"? One which is repeatable.
> Repeatable by a tester who is prepared to work with us on resolving
> it. Those bugs.
>
> Rafael has a list of them. We release kernels when that list still
> has tens of unfixed regressions dating back up to a couple of months.
>


I know he does. But I will still argue that if that is all we work from, and treat
all of those equally, we're doing the wrong thing.
I'm sorry, but I really do not consider "ext4 doesn't compile on m68k" which is
on that list to be as relevant as a "i915 drm driver crashes" bug which is among
us for a while and not on that list, just based on the total user base for either of those.

Does that mean nobody should fix the m68k bug?
Someone who cares about m68k for sure should work on it, or if it's easy for an ext4 developer,
sure. But if the ext4 person has to spend 8 hours on it figuring cross compilers, I say
we're doing something very wrong here. (no offense to the m68k people, but there's just
a few of you; maybe I should have picked voyager instead)

Maybe that's a "boggle" for you; but for me that's symptomatic of where we are today:
We don't make (effective) prioritization decisions. Such decisions are hard, because it
effectively means telling people "I'm sorry but your bug is not yet important". That's
unpopular, especially if the reporter is very motivated on lkml. And it will involve a
certain amount of non-quantifiable judgement calls, which also means we won't always be
right. Another hard thing is that lkml is a very self-selective audience. A bug may be
reported three times there, but never hit otherwise, while another bug might not be reported
at all (or only once) while thousands and thousands of people are hitting it.

Not that we're doing all that bad, we ARE fixing the bugs (at least the oopses/warnings) that
are frequently hit. So I wouldn't blindly say we're doing a bad job at prioritizing. I would
rather say that if we focus only on what is left afterwards without doing a reality check,
we'll *always* have a negative view of quality, since there will *always* be bugs we don't
fix. Linux well over ten million users (much more if you count embedded devices).
A lot of them will have "standard" hardware, and a bunch of them will have "weird" stuff.
Cosmic rays happen. As do overclocking and bad DIMMs. And some BIOSes are just weird etc etc.
If we do not prioritize effectively we'll be stuck forever chasing ghosts, or we'll be stuck
saying "our quality sucks" forever without making progress.

Another trap is to only look at what goes wrong, not on what goes right... we tend to only
see what goes wrong on lkml and it's an easy trap to fall into doomthinking that way.
Are we doing worse on quality? My (subjective) opinion is that we are doing better than last year.
We are focused more on quality. We are fixing the bugs that people hit most. We are fixing most
of the regressions (yes, not all). Subsystems are seeing flat or lower bugcounts/bugrates. Take ACPI,
the number of outstanding bugs *halved* over the last year. Of course you can pick a single
bug and say "but this one did not get fixed", but that just loses the big picture (and
proves the point :). All of this with a growing userbase and a rate of development that's a bit
faster than last year as well.

Can we do better? Always. More testing will help. Both to detect things early, and by
letting us figure out which bugs are important. Just saying "more testing is not relevant
because we're not even fixing the bugs we have now" is just incorrect. Sorry.
More testers helps. Wider range of hardware/usages allows us to find better patterns
in the hard to track down bugs. More testers means more people willing to see if they
can diagnose the bugs at least somewhat themselves, via bisection or otherwise. That's important,
because that's the part of the problem that scales well with a growing userbase.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/